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ETAPS Foreword 


Welcome to the 25th ETAPS! ETAPS 2022 took place in Munich, the beautiful capital 
of Bavaria, in Germany. 

ETAPS 2022 is the 25th instance of the European Joint Conferences on Theory and 
Practice of Software. ETAPS is an annual federated conference established in 1998, 
and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each 
conference has its own Program Committee (PC) and its own Steering Committee 
(SC). The conferences cover various aspects of software systems, ranging from theo- 
retical computer science to foundations of programming languages, analysis tools, and 
formal approaches to software engineering. Organizing these conferences in a coherent, 
highly synchronized conference program enables researchers to participate in an 
exciting event, having the possibility to meet many colleagues working in different 
directions in the field, and to easily attend talks of different conferences. On the 
weekend before the main conference, numerous satellite workshops took place that 
attract many researchers from all over the globe. 

ETAPS 2022 received 362 submissions in total, 111 of which were accepted, 
yielding an overall acceptance rate of 30.7%. I thank all the authors for their interest in 
ETAPS, all the reviewers for their reviewing efforts, the PC members for their con- 
tributions, and in particular the PC (co-)chairs for their hard work in running this entire 
intensive process. Last but not least, my congratulations to all authors of the accepted 
papers! 

ETAPS 2022 featured the unifying invited speakers Alexandra Silva (University 
College London, UK, and Cornell University, USA) and Tomas Vojnar (Brno 
University of Technology, Czech Republic) and the conference-specific invited 
speakers Nathalie Bertrand (Inria Rennes, France) for FoSSaCS and Lenore Zuck 
(University of Illinois at Chicago, USA) for TACAS. Invited tutorials were provided by 
Stacey Jeffery (CWI and QuSoft, The Netherlands) on quantum computing and 
Nicholas Lane (University of Cambridge and Samsung AI Lab, UK) on federated 
learning. 

As this event was the 25th edition of ETAPS, part of the program was a special 
celebration where we looked back on the achievements of ETAPS and its constituting 
conferences in the past, but we also looked into the future, and discussed the challenges 
ahead for research in software science. This edition also reinstated the ETAPS men- 
toring workshop for PhD students. 

ETAPS 2022 took place in Munich, Germany, and was organized jointly by the 
Technical University of Munich (TUM) and the LMU Munich. The former was 
founded in 1868, and the latter in 1472 as the 6th oldest German university still running 
today. Together, they have 100,000 enrolled students, regularly rank among the top 
100 universities worldwide (with TUM’s computer-science department ranked #1 in 
the European Union), and their researchers and alumni include 60 Nobel laureates. 
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The local organization team consisted of Jan Křetínský (general chair), Dirk Beyer 
(general, financial, and workshop chair), Julia Eisentraut (organization chair), and 
Alexandros Evangelidis (local proceedings chair). 

ETAPS 2022 was further supported by the following associations and societies: 
ETAPS e.V., EATCS (European Association for Theoretical Computer Science), 
EAPLS (European Association for Programming Languages and Systems), and EASST 
(European Association of Software Science and Technology). 

The ETAPS Steering Committee consists of an Executive Board, and representa- 
tives of the individual ETAPS conferences, as well as representatives of EATCS, 
EAPLS, and EASST. The Executive Board consists of Holger Hermanns 
(Saarbrücken), Marieke Huisman (Twente, chair), Jan Kofron (Prague), Barbara König 
(Duisburg), Thomas Noll (Aachen), Caterina Urban (Paris), Tarmo Uustalu (Reykjavik 
and Tallinn), and Lenore Zuck (Chicago). 

Other members of the Steering Committee are Patricia Bouyer (Paris), Einar Broch 
Johnsen (Oslo), Dana Fisman (Be’er Sheva), Reiko Heckel (Leicester), Joost-Pieter 
Katoen (Aachen and Twente), Fabrice Kordon (Paris), Jan Křetínský (Munich), Orna 
Kupferman (Jerusalem), Leen Lambers (Cottbus), Tiziana Margaria (Limerick), 
Andrew M. Pitts (Cambridge), Elizabeth Polgreen (Edinburgh), Grigore Rosu (Illinois), 
Peter Ryan (Luxembourg), Sriram Sankaranarayanan (Boulder), Don Sannella 
(Edinburgh), Lutz Schröder (Erlangen), Ilya Sergey (Singapore), Natasha Sharygina 
(Lugano), Pawel Sobocinski (Tallinn), Peter Thiemann (Freiburg), Sebastian Uchitel 
(London and Buenos Aires), Jan Vitek (Prague), Andrzej Wasowski (Copenhagen), 
Thomas Wies (New York), Anton Wijs (Eindhoven), and Manuel Wimmer (Linz). 

Pd like to take this opportunity to thank all authors, attendees, organizers of the 
satellite workshops, and Springer-Verlag GmbH for their support. I hope you all 
enjoyed ETAPS 2022. 

Finally, a big thanks to Jan, Julia, Dirk, and their local organization team for all their 
enormous efforts to make ETAPS a fantastic event. 


February 2022 Marieke Huisman 
ETAPS SC Chair 
ETAPS e.V. President 


Preface 


TACAS 2022 was the 28th edition of the International Conference on Tools and 
Algorithms for the Construction and Analysis of Systems. TACAS 2022 was part of the 
25th European Joint Conferences on Theory and Practice of Software (ETAPS 2022), 
which was held from April 2 to April 7 in Munich, Germany, as well as online due to the 
COVID-19 pandemic. TACAS is a forum for researchers, developers, and users inter- 
ested in rigorous tools and algorithms for the construction and analysis of systems. The 
conference aims to bridge the gaps between different communities with this common 
interest and to support them in their quest to improve the utility, reliability, flexibility, 
and efficiency of tools and algorithms for building computer-controlled systems. 
There were four submission categories for TACAS 2022: 


a 


. Research papers advancing the theoretical foundations for the construction and 
analysis of systems. 
2. Case study papers with an emphasis on a real-world setting. 
3. Regular tool papers presenting a new tool, a new tool component, or novel 
extensions to an existing tool. 
4. Tool demonstration papers focusing on the usage aspects of tools. 


Papers of categories 1—3 were restricted to 16 pages, and papers of category 4 to six 
pages. 

This year 159 papers were submitted to TACAS, consisting of 112 research papers, 
five case study papers, 33 regular tool papers, and nine tool demo papers. Authors were 
allowed to submit up to four papers. Each paper was reviewed by three Program 
Committee (PC) members, who made use of subreviewers. Similarly to previous years, 
it was possible to submit an artifact alongside a paper, which was mandatory for regular 
tool and tool demo papers. 

An artifact might consist of a tool, models, proofs, or other data required for vali- 
dation of the results of the paper. The Artifact Evaluation Committee (AEC) was tasked 
with reviewing the artifacts based on their documentation, ease of use, and, most 
importantly, whether the results presented in the corresponding paper could be accu- 
rately reproduced. Most of the evaluation was carried out using a standardized virtual 
machine to ensure consistency of the results, except for those artifacts that had special 
hardware or software requirements. The evaluation consisted of two rounds. The first 
round was carried out in parallel with the work of the PC. The judgment of the AEC 
was communicated to the PC and weighed in their discussion. The second round took 
place after paper acceptance notifications were sent out; authors of accepted research 
papers who did not submit an artifact in the first round could submit their artifact at this 
time. In total, 86 artifacts were submitted (79 in the first round and seven in the second) 
and evaluated by the AEC regarding their availability, functionality, and/or reusability. 
Papers with an artifact that was successfully evaluated include one or more badges on 
the first page, certifying the respective properties. 
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Selected authors were requested to provide a rebuttal for both papers and artifacts in 
case a review gave rise to questions. Using the review reports and rebuttals, the 
Program and the Artifact Evaluation Committees extensively discussed the papers and 
artifacts and ultimately decided to accept 33 research papers, one case study, 12 tool 
papers, and four tool demos. 

This corresponds to an acceptance rate of 29.46% for research papers and an overall 
acceptance rate of 31.44%. 

Besides the regular conference papers, this two-volume proceedings also contains 
16 short papers that describe the participating verification systems and a competition 
report presenting the results of the 11th SV-COMP, the competition on automatic 
software verifiers for C and Java programs. These papers were reviewed by a separate 
Program Committee (PC); each of the papers was assessed by at least three reviewers. 
A total of 47 verification systems with developers from 11 countries entered the sys- 
tematic comparative evaluation, including four submissions from industry. Two ses- 
sions in the TACAS program were reserved for the presentation of the results: (1) a 
summary by the competition chair and of the participating tools by the developer teams 
in the first session, and (2) an open community meeting in the second session. 

We would like to thank all the people who helped to make TACAS 2022 successful. 
First, we would like to thank the authors for submitting their papers to TACAS 2022. 
The PC members and additional reviewers did a great job in reviewing papers: they 
contributed informed and detailed reports and engaged in the PC discussions. We also 
thank the steering committee, and especially its chair, Joost-Pieter Katoen, for his 
valuable advice. Lastly, we would like to thank the overall organization team of 
ETAPS 2022. 
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Abstract. Logic locking “hides" the functionality of a digital circuit to 
protect it from counterfeiting, piracy, and malicious design modifications. 
The original design is transformed into a “locked" design such that the 
circuit reveals its correct functionality only when it is “unlocked" with 
a secret sequence of bits—the key bit-string. However, strong attacks, 
especially the SAT attack that uses a SAT solver to recover the key bit- 
string, have been profoundly effective at breaking the locked circuit and 
recovering the circuit functionality. 

We lift logic locking to Higher Order Logic Locking (HOLL) by hiding a 
higher-order relation, instead of a key of independent values, challenging 
the attacker to discover this key relation to recreate the circuit func- 
tionality. Our technique uses program synthesis to construct the locked 
design and synthesize a corresponding key relation. HOLL has low over- 
head and existing attacks for logic locking do not apply as the entity to be 
recovered is no more a value. To evaluate our proposal, we propose a new 
attack (SynthAttack) that uses an inductive synthesis algorithm guided 
by an operational circuit as an input-output oracle to recover the hidden 
functionality. SynthAttack is inspired by the SAT attack, and similar to 
the SAT attack, it is verifiably correct, i.e., if the correct functionality is 
revealed, a verification check guarantees the same. Our empirical analy- 
sis shows that SynthAttack can break HOLL for small circuits and small 
key relations, but it is ineffective for real-life designs. 


Keywords: Logic Locking - Program Synthesis - Hardware Security. 


1 Introduction 


High manufacturing costs in advanced technology nodes are pushing many semi- 
conductor design houses to outsource the fabrication of integrated circuits (IC) 
to third-party foundries [26, 42]. A fab-less design house can increase the invest- 
ments in the chip’s intellectual property, while a single foundry can serve multiple 
companies. However, this globalization process introduces security threats in the 
supply chain [25]. A malicious employee of the foundry can access and reverse 
engineer the circuit design to make illegal copies. Logic locking [44] alters the 


© The Author(s) 2022 
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chip’s functionality to make it unusable by the foundry. This alteration depends 
on a locking key that is re-inserted into the chip in a trusted facility, after fab- 
rication. The locking key is, thus, the “secret”, known only to the design house. 
Logic locking assumes that the attackers have no access to the key but they may 
have access to a functioning chip (obtained, for example, from the legal/illegal 
market). However, logic locking has witnessed several attacks that analyze the 
circuit and attempt key recovery [31, 43, 48, 59]. 

In this paper*, we combine the intuitions from logic locking, program syn- 
thesis, and programmable devices to design a new locking mechanism. Our tech- 
nique, called higher order logic locking (HOLL), locks a design with a key rela- 
tion instead of a sequence of independent key bits. HOLL uses program synthe- 
sis [3, 50] to translate the original design into a locked one. Our experiments 
demonstrate that HOLL is fast, scalable, and robust against attacks. Prior at- 
tacks on logic locking, like the SAT attack [51], are not practical for HOLL. Since 
the functionality of the key-relation is completely missing, attackers cannot sim- 
ply make propositional logic queries to recover the key (like [43, 51, 18]). There 
are variants of logic locking, like TTLock [61] and SFLL [60], that attempt to 
combat SAT attacks [51]. However, these techniques use additional logic blocks 
(comparison and restoration circuits) which makes them prone to attacks via 
structural and functional analysis on this additional circuitry [47]. HOLL is re- 
silient to such techniques as it exposes only a programmable logic that does not 
leak any information related to the actual functionality to be implemented. 

In contrast to logic locking, attacking HOLL requires solving a second-order 
problem (see §8 for a detailed discussion on this). To assess the security of our 
method, we design a new attack, SynthAttack, by combining ideas from SAT at- 
tack [51] and inductive program synthesis [50]. SynthAttack employs a counter- 
example guided inductive synthesis (CEGIS) procedure guided via a functioning 
instance of the circuit as an input-output oracle. This attack constructs a syn- 
thesis query to discover key relations that invoke semantically distinct functional 
behaviors of the locked design—witnesses to such relations, referred to as distin- 
guishing inputs, act as counterexamples to drive inductive learning. When the 
locked design is verified to have unique functionality, the attack is declared suc- 
cessful, with the corresponding provably-correct key relation. 

Our experimental results (§6) show that the time required by an attacker 
to recover the key relation for a given set of distinguishing inputs (attack time) 
increases exponentially with the size of key relation. While the attacker may 
be able to recover key relations for small HOLL-locked circuits with small key 
relations, larger circuits are robust to SynthAttack. For example, for the des 
benchmark, the asymmetry between HOLL defense and SynthAttack is large; 
while HOLL can lock this design in less than 100 seconds, the attack cannot re- 
cover the design even within four days for a key relation that increases the area 
overhead of the IC by only 1.2%. Further, the attack time required to unlock 
the designs increase exponentially with the complexity of the key relation. 

The key relation can be implemented with reconfigurable or programmable 


4 An extended version [53] of this paper is also available. 
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to = 20 A T2; to = £o A £2 
= ti = A N 
tı =| £3 A to 1 =| (£o A (ra ® r2) A z3) ieoera, 
t2 = | (x1 A to) t2 =| (£o A r3) (rı © z2), 
Yo = To DB T2 Yo = To P T2 (r2 + rand), 
yo = (xı A z3) V t2 V tı yo = (xı A z3) V to V tı (r3 + ro Ari), 
yı = to Ọ T1 O T3 yı = to Ọ T1 P T3 (ra 11 Dr2)} 
(a) Original circuit (b) Locked circuit (c) Key relation 


Fig. 1: HOLL on a 2-bit Adder. 


devices, like programmable array logic (PAL) or embedded finite-programmable 
gate array (eFPGA). For example, eFPGA, essentially an IP core embedded into 
an ASIC or SoC, is becoming common in modern SoCs [2] and has been shown 
to have high resilience against bit-stream recovery [7]. 
Our contributions are: 
— We propose a novel IP protection strategy, called higher order logic locking 
(HOLL), that uses program synthesis; 
— To evaluate the security offered by HOLL, we propose a strong adversarial 
attack algorithm, SynthAttack; 
— We build tools to apply HOLL and SynthAttack to combinational logic; 
— We evaluate HOLL on cost, scalability, and robustness; 


2 HOLL Overview 


2.1 Threat Model: the Untrusted Foundry 


We focus on the threat model where the attacker is in the foundry [44, 45] to 
which a fab-less design house has outsourced its IC fabrication. Such an attacker 
has access to the IC design and the (locked) GDSII netlist which can be reverse- 
engineered. Also, if the attacker can access a working IC (e.g., by procuring an IC 
from the open market or a discarded IC from the gray market), they can leverage 
the functional IC’s I/O behavior as a black-box oracle. However, we assume the 
attacker cannot extract the bitstream, i.e. the correct sequence of configuration 
bits, from the device. This can be achieved with encryption techniques when the 
bitstream is not loaded into the device. Also, anti-readback solutions can prevent 
the attacker from reading the bitstream from the device. The parameters used 
to synthesize the key relation and the locked circuit (like the domain-specific 
language (DSL), budget etc.) are not known to the attacker (see §8). 


2.2 Defending with HOLL 


Consider a hardware circuit Y + y(X), where X and Y are the inputs and out- 
puts, respectively. HOLL uses a higher-order lock—a secret relation (Y) among a 
certain number of additional relation bits R. We refer to Y as the key relation. 

Fig. la shows an example of a 2-bit adder with input X ({x1 20, x3%2}) and 
output Y (y2y1y0). The circuit is locked by transforming the original expressions 


6 G. Takhar et al. 


(marked in blue) in Fig. 1a to the locked expressions (marked in red) in Fig. 1b. 
The locked expressions use the additional relation bits r2, r3, and r4, enabling 
that this locked design ¢(X,R) functions correctly when the secret relation % 
(Fig. 1c) is installed. The relation 7) establishes the correct relation between the 
relation bits R. The key relation can be excited by circuit inputs (like in ro 
and r1,), constants, or random bits (e.g., from system noise, etc.); for example, 
the value rand in Fig. 1c represents the random generation of a bit (0 or 1) 
assigned to r2. The “output" from the key relation are bits r3 and r4 that must 
satisfy the relational constraints enforced by the key relation. 

For the sake of simplicity, in the rest of the paper, we assume the relation 
bits are drawn only from the inputs X of the design. We will attempt to infer 
key relations of the form Y(X, R). The reader may assume the value rand of in 
Fig. 1c to be a constant value (say 0) to ease the exposition. 

As ¢ also consumes the relation bits R, HOLL transforms the original circuit 
Y © (X) into a locked circuit Y <+ (X,R) such that the locked circuit 
functions correctly if the key relation Y(X, R) is satisfied. In other words, HOLL 
is required to preserve the semantic equivalence between the original and locked 
designs (py = Aw). Note that it only imposes constraints on the input-output 
functionality of the circuits, not on the generated values of internal gates. For 
example, in Fig. 1b, the value of tı may be different from the one in the original 
design (Fig. la), but the final output y2 is equivalent to the original adder. 

This approach has analogies with the well-known logic locking solution [10, 
37, 54]. Traditional logic locking produces a locked circuit by mutating certain 
expressions based on input key bits. HOLL differs from logic locking on the type 
of entities employed as hidden keys. While logic locking uses a key value (i.e., a 
sequence of key-bits), our technique uses a key relation (i.e., a functional relation 
among the key bits). As we attempt to hide a higher-order entity (relation), 
we refer to our scheme as higher-order logic locking (HOLL). As synthesizing 
a relation (a second-order problem) is more challenging to recover than a bit- 
sequence (a first-order problem), HOLL is, at least in theory, is more secure 
than logic locking. Our experimental results (§6) show that this security also 
translates to practice. 


Hardware constraints. Since the key relation must be implemented in the 
circuit, we need to consider practical constraints. For example, the size of the 
key relation affects the size of the programmable logic to be used for its imple- 
mentation. This, in turn, introduces area and delay overheads in the final circuit. 
The practical realizability of this technique adds certain hardware constraints: 
— The key relation must be small for it to have a small area overhead; 
— The key relation must only be executed once to avoid a significant perfor- 
mance overhead; 
— The key relation must encode non-trivial relations between the challenge and 
response bits to strong security; 
— The locked expressions are evenly distributed across the design to protect 
all parts of the circuit, disallowing focused attacks by an attacker on a small 
part of the circuit that contains all locks. 
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Inferring the key relation. HOLL operates by 


1. carefully selecting a set of expressions, E C y, in the original design ¢; 
2. mutating each expression e; € E using the relation bits R to create the 
corresponding locked expression, êi. 


For example, in Fig. la, we select two expressions, E = {e1, e2} where e1 = 21 Ato 
and e2 = 73 A tọ. e1 computes tz and is a function of to and x1, while é; uses £o 
and r3, which is in turn a relation of rọ and rı. We formalize our lock program 
synthesis problem as follows. 


Lock Inference. Given a circuit Y © p(X), construct a locked circuit Y © 
p(X, R) and a key relation (X, R) such that ô is semantically equivalent to 
y with the correct relation Y. Specifically, it requires us to construct: (1) a key 
relation 7 and (2) a set of locked expressions É relating to the set of selected 
expressions FE extracted from y such that the following conditions are met: 


— Correctness: The circuit is guaranteed to work correctly for all inputs when 
the key relation is installed: 


VX. (WR. Y(X, R) = (0(X, R) = 9(X))) (1) 


where ĝ = ylé1/e1,---,€n/€n], for e; E€ E C vy. The notation yle, /e,] implies 
that e, is replaced by ea in the formula y. 

— Security: There must exist some relation 7’ (where y’ Æ w) where the 
circuit exhibits incorrect behavior; in other words, it enforces the key relation 
to be non-trivial: 


W AX SR. (Y(X, R) => G(X, R) # v(X)) (2) 


We pose the above as a program synthesis [40, 49] problem. In particular, we 
search for “mutations” €),...,€2 and a suitable key relation ~ such that the 
above constraints are satisfied. 


2.3 Attacking with SynthAttack 


As we attempt to hide a relation instead of a key-value, prior attacks on logic 
locking (like SAT attacks), which attempt to infer key bit-strings, do not apply. 
However, the attackers can also use program synthesis techniques to recover the 
key relation using an activated instance of the circuit as an input-output oracle. 

We design an attack algorithm, called SynthAttack, combining ideas from 
SAT attack (for logic locking) and counterexample guided inductive program 
synthesis. Our attack algorithm generates inputs X1, X2,..., Xn and computes 
the corresponding outputs Y1, Y2,..., Yn using the oracle, to construct a set of 
examples A = {(X1,¥Y1),.-.,(Xn,Yn)}. Then, the attacker can generate a key 
relation w that satisfies the above examples, À, using a program synthesis query: 


I] 35: ¢C%G, Ri) ^ Y(X; Ri) = ¥; (3) 
Xi YEA 
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The above query requires copies of (X, R) for every example—hence, the for- 
mula will quickly explode with an increasing number of samples. Our scheme is 
robust since the sample complexity of the key relationships increases exponen- 
tially with the number of relation bits employed. Additionally, the attacker does 
not know which input bits excite the key relation and how the relation bits are 
related to each other. 
For the locked adder (Fig. 1b) with the in- Table 1: Inout {(ro © x2), 

put samples shown in Table 1 (first four rows), samples. (rı © xo), 


the above attack can synthesize the key rela- (r2 + 0), 
tion shown in Fig. 2. Columns Y and Y in Ta- (r3 = ro ^ri), 
ble 1 represent the outputs of the original cir- (r4 = 0)} 
cuit and the circuit obtained by the attacker, i 

è De Fig. 2: Gener- 
respectively. Even on a 4-bit input space, when eee l 
25% of all possible samples are provided, the Se ER ER 
attack fails to recover the key relation as shown nion 


by the last input row of Table 1. The red box highlights the Teme in the attacker 
circuit does not match the original design. 


Definition (Distinguishing Input). Given a locked circuit ĝ, we refer to input X 
as a distinguishing input if there exist candidate relations ~, and %2 that evoke 
semantically distinct behaviors on the input X. Formally, X is a distinguishing 
input provided the following formula is satisfiable on some relations yı and wo: 


P(X, R1) # (X, R2) Avi (X, Ri) A Y2(X, Rə) (4) 


It searches for a distinguishing input, Xa, that produces conflicting outputs on 
the locked design. Any such distinguishing input is added to the set of examples, 


A, and the query repeated. If the query is unsatisfiable, it implies that no other 
relation can produce a different behavior on the locked design and so the relation 
that satisfies the current set of examples must be a correct key relation. 

Though SynthAttack significantly reduces the sample complexity of the at- 
tack, our experiments demonstrate that SynthAttack is still unsuccessful at 
breaking HOLL for larger designs. 


3 Program Synthesis to Infer Key Relations 


We represent the key relation w as a propositional formula, represented as a set 
(conjunction) of propositional terms. The terms in w belong are categorised as: 


— Stimulus terms: As mentioned in §2, a subset of the relation bits are 
related to input bits or constants; the stimulus terms appear as (r; + zj) 
where r; € R,x; E€ X U {0,1}. 

— Latent terms: These clauses establish a relation among the relation bits; 
the variables v in these terms are drawn from the relation bits R, i.e. v € R. 


5 All free variables are existentially quantified. 
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For example, in Fig. 1c the terms (ro 4+ 21), (rı < z2), and (rg + rand) are 
stimulus terms, while (r3 + (ro A r1)) and (r4 + (rı ® r2)) are latent terms. 


Budget. As the key relation may need to be implemented within a limited hard- 
ware budget, our synthesis imposes a hard threshold on its size. The threshold 
could directly capture the hardware constraints for implementing the key re- 
lation (e.g., the estimated number of gates or ports) or indirectly indicate the 
complexity of the key relation (e.g., number of relation bits, propositional terms, 
or latent terms). 


3.1 Lock and Key Inference ~ Algorithm 1: HOLL(y,T,Q) 


Algorithm 1 outlines our algorithm for 1 y+ 

inferring the key relation and the locked 2 ¢<-¢ 

circuit. The algorithm accepts an un- 3 done + False 
locked design Y + Y(X) and a budget 4 While not done do 


T for the key relation. 3 E + SelectExpr($) 
6 H, E + Synthesize(y, ĝ, E) 
7 Y + ypuUH 
8 
9 


Main Algorithm. The algorithm it- if Budget(y') < T then 


erates, increasing the complexity of the yY | y 
key relation, till the budget T is reached 10 PH eH ê:/e: | 
(Lines 4-21). In every iteration, the al- 11 ei € Fi, êi € E}] 
gorithm selects a set of suitable expres- 12 else 
sions E for locking, uses our synthe- 13 q + CheckSec(p, $) 
sis procedure to extract a set of addi- 14 if q then 
tional latent terms H for the key rela- 15 | done < True 
tion, and produces the mutated expres- 16 else 
sions ĉ; for each expression e; € E (Line ae i -9 
6). If the additional synthesized rela- 15 R 
tions keep the key relation within the a end 
budget T (Line 8), the mutated expres- 7° end 

21 end 


sions are replaced for e; € E (Line 11). 
HOLL verifies that the solution meets — V 
the two objectives of correctness and security (§2). We handle correctness in 
the Synthesize procedure of Algorithm 1 and security in Lines 13-14 of the 
same algorithm. The CheckSec() procedure verifies if the synthesized (locked) 
circuit and key relations satisfy the security condition (Eqn 2). If CheckSec() 
returns True, the key relation w and the locked circuit ¢ are returned; otherwise, 
synthesis is reattempted. 


22 return ĝ, wv 


Correctness. HOLL attempts to synthesize (via the Synthesize procedure) a 
key relation ~ and a set of locked expressions ê; such that the circuit is guaran- 
teed to work correctly for all inputs given to w; this requires us to satisfy: 


Aw, é1,...,€n. VX. VR. (Y(X, R) => (X, R) = v(X)) (5) 
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(ro — zı), 
i {= x2), (r2 — Xo), 


( 
( 
r5 + (ro A11)), 
( 
( 


(r4 — ( (ro A rı) Ar2)), 
(r3 {= ( (ro Arı) Vr2)) 


(a) Without optimization (b) With opt. 


Ta = ( T5 Ar2)) 


Fig. 4: Key relations generated without and Fig.5: Dependency graph for 
with optimization. the expressions in Fig. la. 


where ¢ = [é1/e1,..-,€n/€n], for e; E€ (E C y). In other words, we attempt 
to synthesize a set of modified expressions E that, once replaced the selected 
expressions in E, produces a semantically equivalent circuit as the original circuit 
if the relation w holds. 

We solve this synthesis problem via counterexample-guided inductive synthe- 
sis (CEGIS) [3]. We provide a domain-specific language (DSL) in which w and ê; 
are synthesized. CEGIS generates candidate solutions for the synthesis problem 
and uses violations to the specification (i.e. the above constraint) to guide the 
search for suitable programs w and ĉi. 

A problem with the above formulation is illustrated in Fig. 4: the key re- 
lation in Fig. 3a uses 5 gates without reusing expressions, “wasting" hardware 
resources. Fig. 3b shows an optimized key relation that reuses the response bit r5, 
allowing an implementation with only 3 gates. To encourage subexpression reuse, 
we solve the following optimization problem where ô = y[é1/e1,...,€n/en], for 
e, CB Cop: 


argmin Iy, é1,...,én. VX. (VR. Y(X, R) = > G(X,R) = y(X)) (6) 
budget(w) 


Security. The security objective requires that the locking (i.e., the key relation 
w and the locked expressions) is non-trivial; that is. there exists some relation 
y' ub! Æ w for which the circuit is not semantically equivalent to the original 
design: 


W, Y Aw, st. AX. (AR. Y(X, R) A G(X, R) # v(X)) (7) 


The above constraint is difficult to establish while synthesizing W; it requires 
a search for a different relation ~’ that makes ô semantically distinct from y 
while ~ maintains semantic equivalence. Instead, we use a two-pronged strategy: 


— We carefully design the DSL used to synthesize ~ and ê; to reduce the 
possibility they generate trivial relations; 

— After obtaining y% and ¢, we attempt to synthesize an alternative relation 
w’ (using 8) such that ¢ is not semantically equivalent to y, ensuring that 
w and Ê do not constitute a trivial locked circuit. 
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W. IX, R'. Y(X) # A(X, R) A Y'(X, R’) (8) 


The procedure CheckSec(w, ¢) (Algorithm 1, Line 13) implements the above 
check (Eqn. 8). 


Theorem 1. If Algorithm 1 terminates, it returns a correct (Eqn. 1) and secure 
(Eqn. 2) locked design. 


Proof. The proof follows trivially from the design of the Synthesize (in particular, 
Eqn. 5) and CheckSec (in particular, Eqn. 8) procedures. 


3.2 Expression Selection 


HOLL constructs the dependency graph [19] (V, D) for expression selection, 
where nodes V are circuit variables. A node v € V represents an expression e 
such that v is assigned the result of expression e, i.e. (v < e). The edges D are 
dependencies: the edge vı —> v2 connects the two nodes vı to v2 if variable vı 
depends on variable v2. The tree is rooted at the output variables and the input 
variables appear as leaves. 

For example, Fig. 5 shows the dependency graph for the circuit in Fig. la. 
Triangles represent input ports (£0, £1, £2, £3) while inverted triangles represent 


output ports (Yo, Y1; Y2). 
Our variable selection algorithm has the following goals: 


1. Ensure expression complexity: The algorithm selects an expression e, 
as a candidate for locking only if the depth of the corresponding variable 
z in the dependency graph lies in a user-defined range [L, U] to create a 
candidate set Z. The lower threshold L assures the expression captures a 
reasonably complex relation over the inputs, while the upper threshold U 
ensures the relation is not too complex to exceed the hardware budget. The 
algorithm starts by randomly selecting a variable zọ € Z from this set. 

2. Encourage sub-expression reuse in key relation: We attempt to select 
multiple “close" expressions; for the purpose, the algorithm randomly selects 
variables w; C Z on which zo (transitively) depend on. 

3. Encourage coverage: We select expressions for locking in a manner so as 
to cover the circuit. To this end, interpreting (V, D) as an undirected graph, 
we randomly select expressions u; E€ Z that are farthest from Zo, i.e. the 
shortest distance between u; and zo is maximized. 


Our algorithm first executes step (1), and then, indeterminately alternates be- 
tween (2) and (3), till the required number of variables are selected. Let us use 
the dependency graph in Fig. 5 to show how the above algorithm operates: 


— Given the user-defined range [1,3], we compose the initial candidate set 
Z= {vo, to, ti, t2, Yi, yo}. 

— Let us assume we randomly pick the expression for y2. Now, yz depends on 
expressions to, tı and tə ({to, t1, yo} C Z) [Rule 1]. 
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— We randomly choose new expressions to lock/transform from {to, t1, y2}. For 
example, we select t2 and to—|[Rule 2]. 

— We find yo, which is the furthest expression from to, t2, yz in Z. We select to 
lock the set of expressions {y1, y2, to,t2}—|[Rule 3]. 


4 HOLL: Implementation and Optimization 


Implementation. We implemented HOLL in Python, using SKETCH [49] syn- 
thesis engine to solve the program synthesis queries. We used BERKELEY-ABC [8] 
to convert the benchmarks into Verilog and PYVERILOG [52], a Python-based 
library, to parse the Verilog code and generate input for SKETCH. We use the 
support for optimizing over a synthesis queries provided by SKETCH to solve 
Eqn. 6. Algorithm 1 may not terminate; our implementation uses a timeout to 
ensure termination. 


Domain Specific Language. We specify our domain-specific language for 
synthesizing our key relations and locked expressions. The grammar is specified 
as generators in the SKETCH [49] language. The grammar for the key relations 
and locked expressions is as follows: 


(G) :=r 4+ x | r 4+ r(Bop)r | r < Uop)r| rar 
(Bop) ::= or | and | xor 
(Uop) ::= not 


The rule (G) := r <4 x is only present in the key relation grammar since the 
locked expressions have no input bits. 


Backslicing. To improve scalability, we use backslicing [55] to extract the por- 
tion of the design related to the expressions selected for locking. For a variable 
vi, the set of all transitive dependencies that can affect the value of v; is referred 
to as its backslice. For example, in Fig. 5, backslice(t2) = {to, vo, £1, £2}. 

Given the set of expressions E, we compute the union of the backslices of the 
variables in EF, i.e. all expressions B in the subgraph induced by e € E in the 
dependency graph; we use B C E for lock synthesis. 

Backslicing tilts the asymmetrical advantage further towards the HOLL de- 
fense. The attacker cannot apply backslicing on the locked design since the de- 
pendencies are obscured, preventing the extraction of the dependency graph. 


Incremental Synthesis. Given a set of expressions E, the procedure Synthesis 
in Algorithm 1 creates a list of relations H and a new set of locked expressions Ê. 
If the list of expressions is large, we select the expressions in the increasing order 
of their depth in the dependency graph. The lower the depth of the expression 
is, the closer it is to the inputs, and the simpler is the expression. Selecting an 
expression with the lowest depth first (say e1) ensures that other expressions (e;) 
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depending on this expression can use the relations H generated during synthesis 
of êi. This also makes synthesizing the other expressions easier as the current 
relation has some sub-expressions on which the new relations can be built. 


5 SynthAttack: Attacking HOLL with Program Synthesis 


As HOLL requires inference of relations and not values, existing attacks designed 
for logic locking do not apply. We design a new attack, SynthAttack, that is 
inspired by the SAT attack [51] for logic locking and counterexample-guided 
inductive program synthesis (CEGIS) [50]. 


5.1 The SynthAttack Algorithm 


SynthAttack runs a CEGIS ~ Algorithm 2: SynthAttack(¢, LA 
loop: it accumulates a set of 


examples, A. These examples, 1 A 
; 2 Qo 7 

A, are used to constrain the A 

fth didate k 3 while True do 
space of the candidate key- 4 X' © Solvex(Qi 
relations. SynthAttack, then, % ^ (G(X, R1) # G(X, R2)) 
uses a verification check to 6 A p(X, R1) A W2(X, R2)) 
confirm if the collected ex- 7 if X’ = | then 
amples are sufficient to syn- 8 | break 
thesize a valid key-relation. ə end 
Otherwise, the counterexam- 10 Y’ & IX’ 
ple from the failed verifica- 11 Qiy — Qi A ((X', Ri) e Y’) 
tion check is identified as an 12 ^A ((X', R$) e Y’) 
distinguishing input (§2) 13 A p(X’, Ri) A Y(X’, RS) 
to be added to A, and the al 14 ieitl 
gorithm repeats. 15 end 

If there does not exist any 16 %1, Y2 © Solvey, v2 (Qi) 
17 return Yı 


distinguishing input for the 
locked circuit ~, then ô has 
a unique semantic behavior—and that must be the correct functionality. Any 
key-relation that satisfies the counterexamples (distinguishing inputs) generated 
so far will be a valid candidate for the key relation. An inductive synthesis 
strategy based on distinguishing inputs allows us to quickly converge on a valid 
realization of the key-relation as each distinguishing input disqualifies many po- 
tential candidates for the key relation. Note that (as we illustrate the following 
example) there may be multiple, possibly semantically dissimilar, realizations of 
a key-relation that enables the same (correct) functionality on the locked circuit. 
SynthAttack is outlined in Algorithm 2: the algorithm accepts the design of 
the locked circuit (@) and an activated circuit (y) (the locked circuit ĝ activated 
with a valid key-relation w). The notation [4(w)] indicates that this activated 
circuit can only be used as an input-output oracle, but cannot be inspected. 
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SynthAttack runs a counterexample-guided synthesis loop (Line 3). It checks 
for the existence of a distinguishing input in Line 4: if no such distinguishing 
input exists, it implies that the current set of examples is sufficient to synthesize 
a valid key-relation. So, in this case, the algorithm breaks out of the loop (Line 7- 
8) and synthesizes a key-relation (Line 16), that is returned as the synthesized, 
provably-correct instance for the key relation. 

If there exists a distinguishing input X’ (in Line 4), the algorithm uses the ac- 
tivated circuit to compute the expected output Y’ corresponding to X (Line 10). 
This new counterexample (X’,Y’) is used to block all candidate key-relations 
that lead to an incorrect behavior, thereby reducing the potential choices for 71 
and %2. The loop continues, again checking for the existence of distinguishing 
inputs on the updated constraint for Qj. 

The theoretical analysis of SynthAttack is available in the extended ver- 
sion [53]. The algorithm only terminates when it is able to synthesize a provably 
valid key-relation, that allows us to state the following result. 


Theorem 2. Algorithm 2 will always terminate, returning a key-relation Yı 
such that (yı) is semantically equivalent to p(w), where w is the “correct” 
relation hidden by HOLL. (The proof is available in the extended version [53].) 


Example. SynthAttack on Fig. 1b iteratively AET, Table 2: Dis- 
generates six distinguishing inputs (Table 2). (rı n a , tinguishing 
The key relation synthesized by SynthAttack (p, ,, - inputs. 

(Fig. 6) is not semantically equivalent to the (r3 © r2 Arı), Se 
hidden key-relation that was computed and (r4 4 =r2 A r1)} — 


hidden by HOLL (Fig. 1c). This shows that 1101/100 
there may exist multiple valid candidates for 
the key-relation that all evoke the same func- 
tionality on the locked design. For example, 
X = 0100 generates r4 = 1 for the key rela- 
tion in Fig. 1c but r4 = 0 for Fig. 6; however, se 
the output of the locked circuit remains the same in both cases (Y = 001). 


Fig.6: Key rela- 0101010 
tion generated by 0111/100 
SynthAttack. 1001011 


6 Experimental Evaluation 


We selected 100 combinational benchmarks from ISCAS’85 [1] and MCNC [58] 
and report the time for program synthesis and the overhead after applying our 
locking method. For long running experiments, we select a subset of 10 randomly 
selected benchmarks where, number of input ports range between 16 and 256, 
output ports range range between 7 and 245, AND gates in range [135, 4174]. 

For our experiments, we use number of relation terms as budget in the range 
[12-14] for the key relation and depth of expression selection in range [2-4]. 
We conduct our experiments on a machine with 32-Core Intel(R) Xeon(R) Silver 
4108 CPU @ 1.80GHz with 32GB RAM. 

For both HOLL and SynthAttack, we use the SKETCH synthesis tool. Since 
synthesis solvers are difficult to compare across different problem instances, we 
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were wary of the case where the defender gets an edge over the attacker due 
to use of different tools. We create the attack-team-defence-team asymmetry by 
controlling the computation time: while the defender gets 20 minutes (1200s) to 
generate locked circuit, the attacker runs the attack for up to 4 days. 

Our experiments aim to answer five research questions: 


RQ1. What is the attack resilience of HOLL? (§6.1) 

RQ2. How do impact expression selection heuristics affect attack resilience? (§6.2) 

RQ3. What is the hardware cost for HOLL? (86.3) 

RQ4. What is the time taken to synthesize the locked design and key-relation 
for HOLL? (refer to the extended version [53]) 

RQ5. What are the impact of the optimizations for scalability (backslicing and 
incremental synthesis)? (refer to the extended version [53]) 


Here is a summary of our findings: 


Security. The key relations can be recovered completely by the attacker 
via SynthAttack but only for small circuits with a small hardware budget. 
For medium and large designs, key relations are fast to obtain (<1200s) 
but cannot be recovered by our attack even within 4 days. This shows our 
defense is efficient while our attack is strong but not scalable. 


Hardware Cost. Our key relations with a budget of 12-14 latent terms 
have a minimal impact on the designs and the overhead reduces as the size 
of the circuit grows. On the largest benchmark, the area overhead is 1.2%. 
The corresponding configurations for programmable devices are small and 
provide high security. 


HOLL Performance. The HOLL execution time ranges between 8s and 
1001s, with an average of 33s for small, 17s for medium, and 60s for large 
designs for the budget of 8-10 latent terms. Our optimizations are crucial 
for the scalability of our HOLL defense (locking) algorithm: we fail to lock 
enough expressions in large circuits without these optimizations. 


6.1 Attack Resilience 


We define attack resilience of locked circuit, ĝ, in terms of time taken to obtain 
a key relation, 7’, such that ô A w’ is equivalent to original circuits, ọ. 


Attack time. Fig. 8 shows the cumulative time spent till the it” iteration 


(y-axis) of the loop versus the loop counter i, that is also the number of distin- 
guishing inputs (samples) generated so far (x-axis). We show exponential trend 
curves (as a solid pink line) to capture the trend in the plotted points while 
the data-points are plotted as blue dots. The plots show that the plotted points 
follow the exponential trend lines, illustrating that SynthAttack does not scale 
well, thereby asserting the resilience of HOLL. 

SynthAttack failed to construct a valid key-relation for any of these ten de- 
signs within a timeout of 4 days. However, for small designs with lesser number 
of latent terms, SynthAttack was able to construct a valid key-relation (Fig. 10). 
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Fig. 8: Cumulative time for successive iterations Fig. 10: Attack time vs #la- 
of SynthAttack (best viewed in color) tent terms for i9 and al2 


Attack resilience vs. number of latent terms. The complexity of the key 
relation increases with the number of relation bits. As shown in Fig. 10 (for 
benchmarks al2 and i9), the time required to break the locked circuit increases 
exponentially as the number of relation bits increases. We gave a timeout of 10 
hours for this experiment and al2 timed out at 9 latent terms, and i9 timed 
out at 8 latent terms. Both results are for locked circuits with variables selected 
with the depth of locked expression, €;, equal to 1. 


6.2 Impact of Expression Selection on Attack Resilience 


Attack resilience vs. Depth of locked expression. The attack resiliency 
of increases significantly as we increase the depth of the locked expression 
selected for HOLL for p. We observe that for a number of latent terms in key 
relation equal to 2, for benchmark a12, increases from 213s to 3788s for depth 1 
and 2, respectively. For benchmark i9, attack time increases from 351s to 1141s 
for depth 1 and 2, respectively. 


Attack time vs. Coverage. To show the effect of coverage we select expres- 
sions (in e; € E) such that the distance (§3.2) among the expressions is largest 
(termed as diverse) and smallest (termed as converged). The attack time to 
break the locked circuit is more for diverse than converged expression selection 
heuristic. For example, for benchmarks C432 and i9, attack time increases from 
115s to 142s and 229s to 316s, respectively, when expression selection heuristic 
is changed from converged to diverse. The results are with three latent terms. 


6.3 Hardware cost 


The key relations can be implemented either as embedded Field Programmable 
Gate Array (ef PGA) or Programmable Array Logic. We synthesize the original 
and locked designs with SYNOPSYS DESIGN COMPILER R-2020.09-SP1 targeting 
the Nangate 15nm ASIC technology at standard operating conditions (25°C). 
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Table 3 provides the esti- Table 3: Hardware Impact of HOLL. 
mated cost for implementing 

the key relations with pro- i : Over- 

Orig. Key Relat 
grammable devices. To do so, =e aia head 
we compute the number of Area | Area og: #Ed: | Area 
equivalent NAND2 gates used Bench (um?) | (um?) LUT oe (%) 
its 


to estimate the number of 6- 


input LUTs. Given the num- al2 17.89 | 4.473 138 8,832 25.0 
i cht 20.74 | 4.178 132 8,448 20.1 


ber of LUTs, we give an caga | 90.05 | 4.866 150 9,600 | 243 
estimation of the equivalent csgo | 50.04 | 4.325 132 8,448 86 
number of configuration bits i9 77.07 | 4.129 126 8,064 5.4 
(see [53] for details)-including i7 | 80.41 | 4.129 126 8,064 | 43 
those for switch elements. Re- x3 95.21 | 5.014 156 9,984 5.3 
sults show that the size of the frg2 | 100.81 | 4.669 144 9,216 | 4.6 
key relations is independent i8 | 120.37] 4.325 132 8,448 3.6 
of original design size. Table des 445.37 | 5.554 174 11,136 1.2 
3 reports the fraction of the 

area locked with HOLL (key relation) to the area of the original circuit. The 
results show that the impact of HOLL is low, mainly for large designs. 


7 Related Work 


Logic Locking: Attacks and Defenses. Existing logic locking methods aptly 
operate on the gate-level netlists [54]. Gate-level locking cannot obfuscate all the 
semantic information because logic synthesis and optimizations absorb many of 
them into the netlist before the locking step. For example, constant propagation 
absorbs the constants into the netlist. Recently, alternative high-level locking 
methods obfuscate the semantic information before logic optimizations embed 
them into the netlist [37, 17]. For example, TAO applies obfuscations during 
HLS [37] but requires access to the HLS source code to integrate the obfuscations 
and cannot obfuscate existing IPs. Protecting a design at the register-transfer 
level (RTL) is an interesting compromise [29, 10]. Most of the semantic informa- 
tion (e.g., constants, operations, and control flows) is still present in the RTL 
and obfuscations can be applied to existing RTL IPs. In [29], the authors pro- 
pose structural and functional obfuscation for DSP circuits. In [10], the authors 
propose a method to insert a special finite state machine to control the tran- 
sition between obfuscated mode (incorrect function) and normal mode (correct 
function). Such transitions can only happen with a specific input sequence. Dif- 
ferently from [13], we extract the relation directly from the analysis of a single 
RTL design, making the approach independent of the design flow. None of these 
methods consider the possibility of hiding a relation among the key bits. 


Program Synthesis. Program synthesis has been successful in many domains: 
synthesis of heap manipulations [39, 20, 57], bit-manipulating programs [27], 
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bug synthesis [40], parser synthesis [30, 46], regression-free repairs [6, 5], syn- 
chronization in concurrent programs [56], boolean functions [22, 24, 23] and even 
differentially private mechanisms [38]. There has also been an interest in using 
program synthesis in hardware designs [16]. VeriSketch [4] exploits the power 
of program synthesis in hardware design. Our work is orthogonal to the objec- 
tives and techniques of VeriSketch: while VeriSketch secures hardware against 
timing attacks, we propose a hardware locking mechanism. Zhang et al. [62] use 
SyGUS based program synthesis to infer environmental invariants for verifying 
hardware circuits. We believe that this work shows the potential of applying pro- 
gramming languages techniques in hardware design. We believe that there is also 
a potential of applying program analysis techniques, symbolic [9, 21, 12, 36, 34], 
dynamic [41, 14] and statistical [28, 32, 11, 33], for hardware analysis; this is a 
direction we intend to pursue in the future. 


8 Discussion 


We end the paper with an important clarification: the ef PGA configuration in 
HOLL can also be represented as a bit sequence (i.e., a sequence of configuration 
bits). So, why can an attacker not launch attacks similar to SAT attacks on logic 
locking to recover the HOLL configuration bitstream? 

The foremost reason is that while the key-bits in traditional logic locking 
simply represent a value that the attacker attempts to recover, the bit-sequence 
in HOLL is an encoding of a program [15, 35]. This raw bit-sequence used to 
program an eF PGA is too “low-level" to be synthesized directly—the size of such 
bit-streams is about 60-85 times of the keys used in traditional logic locking (128 
key bit-sequence). So, the HOLL algorithm designer uses a higher-level domain- 
specific language (DSL) to synthesize the key relation (see §4), that is later 
“compiled" to the configuration sequence. The attacker will also have to use a 
similar strategy of using a high-level DSL to break HOLL. 

However, while the designer of the key relation can use a well-designed small 
domain-specific language (DSL) that includes the exact set of components re- 
quired (and a controlled budget) to synthesize the key relation, the attacker, 
not aware of the key relation or the DSL, will have to launch the attack with 
a “guess" of a large overapproximation. In other words, the domain-specific 
language used for synthesis is also a secret, thereby making HOLL much 
harder to crack than traditional logic locking. 

We evaluate HOLL (86.1) under the assumption that the DSL (and budget) 
are known to the attacker. In real deployments (when the DSL is not known to 
the attacker), HOLL will be still more difficult to crack. 
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Abstract. In rational synthesis, we automatically construct a reactive 
system that satisfies its specification in all rational environments, namely 
environments that have objectives and act to fulfill them. We complete 
the study of the complexity of LTL rational synthesis. Our contribution 
is threefold. First, we tighten the known upper bounds for settings that 
were left open in earlier work. Second, our complexity analysis is para- 
metric, and we describe tight upper and lower bounds in each of the 
problem parameters: the game graph, the objectives of the system com- 
ponents, and the objectives of the environment components. Third, we 
generalize the definition of rational synthesis, combining the cooperative 
and non-cooperative approaches studied in earlier work, and extend our 
complexity analysis to the general definition. 


1 Introduction 


Synthesis is the automated construction of a system from its specification. The 
basic idea is simple and appealing: instead of developing a system and verifying 
that it adheres to its specification, we use an automated procedure that, given a 
specification, constructs a system that is correct by construction, thus enabling 
the designers to focus on what the system should do rather than how to do it. A 
reactive system interacts with its environment and should satisfy its specifica- 
tion in all environments [8,25]. Accordingly, synthesis corresponds to a zero-sum 
game between the system and the environment, where they together generate a 
computation, the system wins if the computation satisfies the specification, and 
otherwise, the environment wins. 

In practice, the requirement to satisfy the specification in all environments 
is often too strong. Therefore, it is common to add assumptions on the envi- 
ronment. An assumption may be direct, say a specification that restricts the 
possible behaviors of the environment [5], or less direct, say a bound on the size 
of the environment or other resources it uses [14]. In [11], the authors suggest 
a conceptual assumption on the environment, namely its rationality: Rational 
synthesis is based on the idea that the components composing the environment 
typically have objectives of their own, and they act to achieve their objectives. 
For example, clients interacting with a server typically have objectives other 
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than to fail the server. As shown in [11], the system can capitalize on the ra- 
tionality and objectives of components that compose its environment. Adding 
rationality into the picture makes the corresponding game non-zero-sum [22], 
thus objectives of different players may overlap. 


The interesting questions about non-zero-sum games concern stable out- 
comes, in particular Nash equilibria (NE) [21]. More formally, each of the players 
in the game has a strategy that directs her which actions to take; a profile is a 
vector of strategies, one for each player; each profile has an outcome (in our case, 
the computation generated when the system and the environment follow their 
strategies); and a profile is an NE if no player has an incentive to deviate from 
it (in our case, to change her strategy in a way that would cause the outcome of 
the new profile to satisfy her objective). 

Two approaches to rational synthesis have been studied. In cooperative ra- 
tional synthesis (CRS) [11], the desired output is an NE profile whose outcome 
satisfies the objective of the system. Thus, in CRS, we assume that we can sug- 
gest strategies to the environment players, and once they have no incentive to 
deviate from these strategies, they follow them. Then, in non-cooperative ratio- 
nal synthesis (NRS) [15], the desired output is a strategy for the system player 
such that the objective of the system is satisfied in the outcome of all NE profiles 
that include this strategy. Thus, in NRS, the environment players are rational, 
but we cannot suggest them a strategy. 


The cooperative and non-cooperative approaches correspond to different set- 
tings in reality, having to do both with the technical ability to communicate a 
strategy to the environment players, say due to different architectures, as well 
as the willingness of the environment players to follow a suggested strategy. As 
shown in [1], the two approaches are related to the two stability-inefficiency mea- 
sures of price of stability [3] and price of anarchy [16,23]. Additional related work 
includes rational verification [27,12], where we check that a given system satisfies 
its specification when interacting with a rational environment, and extensions of 
rational synthesis to richer settings (multi-valued, partial visibility, and more) 
[4,13,18]. 

The complexity of rational synthesis was first studied for the case the in- 
put to the problem is the objectives of the players, given by LTL formulas. In 
this setting, CRS is in 2EXPTIME [11], whereas the best known upper bound 
for NRS until recently was 3EXPTIME [15] (the paper specifies a 2EXPTIME 
upper bound, but a careful analysis of the algorithm reveals that it is actu- 
ally in 3EXPTIME), improved to 2EXPTIME for turn-based games with two 
players [18]. The complexity analysis above suggests that rational synthesis is 
not harder than traditional synthesis. One may wonder whether this has to do 
with the doubly-exponential translation of LTL to deterministic automata, which 
dominates the complexity. To answer this question, [9] studies the complexity 
of rational synthesis where the objectives of the players are given by w-regular 
winning conditions in a game graph (e.g., reachability, Biichi, and parity). The 
analysis in [9] also distinguishes between the case the number of players is fixed 
and the case it is not. As shown there, in most cases the complexity of the ratio- 
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nal variant coincides with the complexity of the zero-sum game. In some cases, 
however, it does not. For example, while the problem of deciding Rabin games 
is NP-complete [10], the best algorithm for solving CRS with Rabin objectives 


is in pNP going up to PSPACE-complete in NRS, and going higher when the 
number of players is not fixed [9]. 

In this work, we complete the study of the complexity of LTL rational syn- 
thesis. Our contribution is threefold. First, we tighten the known upper bound 
for NRS for settings with three or more players and for concurrent games, which 
were left open in [9,18]. Second, our complexity analysis is parametric, and we de- 
scribe tight upper and lower bounds in each of the problem parameters: the game 
graph, the objectives of the system players, and the objectives of the environment 
players. Third, we generalize the definition of rational synthesis, combining the 
cooperative and non-cooperative approaches, and extend our complexity analysis 
to the general definition. Below we elaborate on each of the contributions. 

Let us start with the generalization of the problem. In our general definition, 
we may suggest a strategy only to a subset of the environment players. Thus, 
we distinguish between three types of players: controllable, cooperative uncon- 
trollable, and non-cooperative uncontrollable. Then, in the (general) rational- 
synthesis (RS) problem, we are given a labeled graph and LTL formulas that 
specify the objectives of the players, and we seek strategies for the controllable 
and the cooperative-uncontrollable players such that the objectives of the con- 
trollable players are satisfied in the outcome of every NE profile that extends 
these strategies. Note that CRS and NRS can be viewed as special cases of RS 
where the uncontrollable players are all cooperative or all non-cooperative. 

In the tight-complexity front, our algorithms reduce rational synthesis to the 
nonemptiness problem of tree automata. The automata accept certified strategy 
trees: trees that are labeled by both a strategy for the controllable player’ and 
information about uncontrollable players that deviate and the strategies to which 
they deviate. The most technically-challenging algorithm we describe is for NRS 
in the concurrent setting. While in the turn-based setting, we need a single player 
that deviates in order to justify a path in which the objective of the controllable 
player is not satisfied, in the concurrent setting, where the players choose actions 
simultaneously and independently, we need to consider sets of uncontrollable 
players. This makes the certificate much more complex. In particular, it involves 
labels from an exponential alphabet, which introduces an additional challenge, 
namely a need to decompose labels along branches in the tree. Also, while in the 
turn-based setting, an NE always exists, concurrent games with three or more 
players need not have an NE [7], and so a certified strategy tree should also 
certify the existence of an NE. 

Finally, in the parameterized-complexity front, the fact our algorithms use 
tree automata (rather than a translation to Strategy Logic [6], which has been the 
case in [11,15]), enables us to analyze the complexity in each of the parameters of 
the problem: the game graph G, the objective pı of the controllable player, and 
the objectives w2,..., Wx of the uncontrollable players. For CRS, [18] studies the 
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parameterized complexity in turn-based games with two players.” The algorithm 
there is based on a distinction between the case the uncontrollable player satisfies 
her objective and the case she does not. Generalizing this to an arbitrary number 
of players, we parameterize solutions with the set of the uncontrollable players 
whose objectives are satisfied, and give a uniform solution to all cases. This also 
enables us to seek solutions that favor some or all uncontrollable players. 

We show that the complexity of CRS is polynomial in |G|, doubly-exponential 
in |w|,..-,|Wx|, and only exponential in |v]. Thus, in terms of the system 
specification, CRS is in fact easier than traditional synthesis! Once we move to 
NRS or RS, the complexity becomes doubly exponential in all objectives. We 
describe tight lower bounds for the different parameters, and we show that they 
are valid already for the case k = 2 and the game is turn based. Specifically, 
we prove that CRS is EXPTIME-hard even when G and wz are fixed, and is 
2EXPTIME-hard even when G and y; are fixed. Similarly, NRS is 2EXPTIME- 
hard even when only one of pı and w is not fixed. In order to see the technical 
challenge in our lower-bound proofs, consider the current 2EXPTIME lower- 
bound proof for CRS, where synthesis of an objective w for the system is reduced 
to CRS with objectives ~ for the system and ~y for the environment. The 
reduction crucially depends on both objectives not being fixed, and just changing 
either of them to True or False does not do the trick. In order to get 2EXPTIME- 
hardness in |w2|, we need to cleverly manipulate both G and 41. 

Together, our results complete the complexity picture for a generalized def- 
inition of rational synthesis, for both turn-based and concurrent systems, with 
any number of components, and with the exact dependencies in each of the 
parameters of the problem. 

Due to the lack of space, some proofs are omitted and can be found in the 
full version, in the authors’ URLs. 


2 Preliminaries 


2.1 LTL, trees, and automata 


The logic LTL is used for specifying on-going behaviors of reactive systems [24]. 
Formulas of LTL are constructed from a set AP of atomic propositions using 
the usual Boolean operators and the temporal operators G (“always”) and F 
(“eventually”), X (“next time”) and U (“until”). The semantics of LTL is de- 
fined with respect to infinite computations in (24”)”. We are going to use LTL 
for specifying the objectives of the system and the components composing the 
environment. 

Given a set D of directions, a D-tree is a set T C D* such that if x-d ET, 
where x € D* and d € D, then also x € T. The elements of T are called nodes, 
and the empty word € is the root of T. For every x € T, the nodes x - d, for 


? The study in [18] considers perspective games [19], which adds the challenge of partial 
visibility on top of rational synthesis, but the results there imply the desired bounds 
for the case of full visibility. 
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d € D, are the successors of x, and the direction of node < -d is d. A path h in 
a tree T is a set h C T such that £ € h and for every x € h, either x is a leaf or 
there exists a unique d € D such that x -d € h. We sometimes refer to paths in 
T as words in D* or D”. For a finite path h C D* and a finite or infinite path 
h’ C D*, we use h < h’ to indicate that h is a prefix of h’, thus h C h’. Given an 
alphabet X, a X-labeled D-tree is a pair (T,T) where T is a tree and7:T > X 
maps each node of T to a letter in X. 

Our algorithms use automata on infinite words and trees. We are going to use 
nondeterministic and universal automata, yet define below alternating automata, 
which subsume both classes. For a set X, let B* (X) be the set of positive Boolean 
formulas over X (i.e., Boolean formulas built from elements in X using A and 
V), where we also allow the formulas true and false. For a set Y C X anda 
formula 0 € B+(X), we say that Y satisfies 0 iff assigning true to elements in 
Y and assigning false to elements in X \ Y makes 0 true. An alternating tree 
automaton is A = (X, D, Q, din, ô, a), where X is the input alphabet, D is a set 
of directions, Q is a finite set of states, 6: Q x X > Bt(D x Q) is a transition 
function, qin € Q is an initial state, and a C Q specifies a Btichi or a co-Biichi 
acceptance condition. For a state q E€ Q, we use A? to denote the automaton 
obtained from A by setting the initial state to be q. The size of A, denoted |A], 
is the sum of lengths of formulas that appear in ô. 

The alternating automaton A runs on X-labeled D-trees. A run of A over 
a S-labeled D-tree (T,7) is a (T x Q)-labeled N-tree (Tp, r}. Each node of T, 
corresponds to a node of T. A node in T,, labeled by (a, q), describes a copy of 
the automaton that reads the node x of T and visits the state q. Note that many 
nodes of T, can correspond to the same node of T. The labels of a node and its 
successors have to satisfy the transition function. Formally, (Tp, r} satisfies the 
following: 


1. (1) € € T, and r(e) = (€, din). 

2. (2) Let y € T, with r(y) = (x, q) and ô(q, T(x)) = 0. Then there is a (possibly 
empty) set S = {(co, qo), (C1, q1), ---, (Cn—1;Qn—1)} E D x Q, such that S 
satisfies 0, and for all 0 < i < n—1, we have y-i € T, and r(y-i) = (£ ci, qi). 


For example, if (T, T} is a {0, 1}-tree with T(£) = a and (qin, a) = ((0, q1) V 
(0, g2)) A((0, g3) V (1, q2)), then, at level 1, the run (T, ry includes a node labeled 
(0,q1) or a node labeled (0,q2), and includes a node labeled (0,q3) or a node 
labeled (1, q2). Note that if, for some y, the transition function 6 has the value 
true, then y need not have successors. Also, 6 can never have the value false in 
a run. 

A run (Tp, r) is accepting if all its infinite paths satisfy the acceptance condi- 
tion. Given a run (T,,r} and an infinite path m C T,, let inf (a) C Q be such that 
q € inf (r) if and only if there are infinitely many y € 7 for which r(y) € T x {q}. 
That is, in f(r) contains exactly all the states that appear infinitely often in 7. A 
path r satisfies a Büchi acceptance condition a iff inf(7) a # Ø, and satisfies a 
co-Biichi acceptance condition a iff inf (r) Ma = Ø. We also consider the parity 
acceptance condition, where a: Q — {0,1,...,&} maps each state to a color in 
{0,1,..., k}, and a path r satisfies a if the minimal color visited infinitely often 
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is even, thus min{i : inf(7)Ma~1(i) 4 Ø} is even. An automaton accepts a tree 
iff there exists a run that accepts it. We denote by L(A) the set of all X-labeled 
trees that A accepts. The size of A, denoted |A|, is the sum of lengths of the 
description of its transition function. 

The alternating automaton A is nondeterministic if for all the formulas that 
appear in 6, if (c1,q1) and (c2,q2) are conjunctively related, then cı Æ c2. (ie., 
if the transition is rewritten in disjunctive normal form, there is at most one 
element of {c} x Q, for each c € D, in each disjunct). Note that then, the run 
tree T, is equal to T, and the »-labels in 7 are replaced by Q-labels in r. The 
automaton A is universal if all the formulas that appear in 6 are conjunctions 
of atoms in D x Q. Note that then, there is only one run tree of A on (T,7T), yet 
each note x € T may have several nodes y € T, such that r(y) = (x, q) for some 
q € Q. Finally, A is deterministic if it is both nondeterministic and universal. 
The automaton A is a word automaton if |D| = 1. Then, we can omit D from 
the specification of the automaton and denote the transition function of A as 
ô: Qx X > Bt(Q). If the word automaton is nondeterministic or universal, 
then 6: Qx X > 22. 

We denote different types of automata by three-letter acronyms in {D, N, U}x 
{F, B,C, P} x{W,T}, where the first letter describes the branching mode of the 
automaton (deterministic, nondeterministic, or universal), the second letter de- 
scribes the acceptance condition (finite, Biichi, co-Biichi, or parity), and the third 
letter describes the object over which the automaton runs (words or trees). For 
example, UCT stands for a universal co-Biichi tree automaton. 


2.2 Concurrent multiplayer games 


For k > 1, let [k] = {1,...,k}. A k-player game graph is a tuple G = (AP, V, vo, 
{Ai helk] {Kifie[n], 4,7), where AP is a set of atomic propositions, V is a set of 
vertices, vo € V is an initial vertex, and for i € [k], the set A; is a set of actions 
of Player i, and k; : V — 24: specifies the set of actions that Player i can take 
at each vertex. 

A move in G is a tuple (a1,...,a%) E A1 X- - -X Ag, describing possible choices 
of actions for all k players. A move (a1,...,a@,%) is possible for vertex v € V if 
a; € &i(v) for all i € [k]. Then, the transition function 6: V x Ay x --- x Ak > V 
is a deterministic function that maps each vertex and possible move for it to a 
successor vertex. Finally, the function r : V > 24? maps each vertex to the set 
of atomic propositions that hold in it. 

A game is a tuple G = (G, {Yi }icjk]), where G is a k-player game graph, and 
wi, for i € [k], is an LTL formula over AP, describing the objective of Player i. 
In a beginning of a play in the game, a token is placed on vg. Then, at each 
round, the players choose actions simultaneously and independently of the other 
players, and the induced move determines the successor vertex. Repeating this, 
the players generate a play p = vo, v1,... in G, which induces the computation 
T(P) = T(vo),T(v1),-.. € (24?)”. For every i € [k], Player i aims for a play 
whose computation satisfies y;. For an LTL formula y, let L(y) C (24°) be 
the set of computations that satisfy w. 
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A strategy for Player i is a function f; : Vt — A; that maps histories of 
the game to an action suggested to Player i. The suggestion has to be consistent 
with «i. Thus, for every vovi -vj E€ V+, we have that f;(vov1-+- vj) € Ki(vy). A 
profile is a tuple 7 = (fi,..., fk) of strategies, one for each player. The outcome 
of a profile 7 = (f1,..., fk) is the play obtained when the players follow their 
strategies. Formally, Outcome(7) = vo,v1,... is such that for all j > 0, we 
have that vj41 = 6(v;,(fi(vo--+¥;),°** »fe(vo++-v;))). For a subset S C [k] of 
players, an S-profile is a set of strategies, one for each player in S. We say that 
a profile m extends an S-profile x’ if the players in S' use in 7 their strategies in 


T. 


Consider a profile m. The set of winners in m, denoted Win(r), is the set 
of players whose objectives are satisfied in Outcome(r). Formally, i € Win(z) 
iff r(Outcome(r)) € L(y). The set of losers in m, denoted Lose(r), is then 
[k] \ Win(z), namely the set of players whose objectives are not satisfied in 
Outcome(7). 


A game G is zero-sum if the objectives of the players form a partition of all 
possible behaviors. That is, for every i 4 j € [k], we have that L(w,)NL(~;) = 0, 
and Uc.) L(Yi) = (24”)*. Accordingly, for every profile 7 in a zero-sum game, 
we have that |Win(z)| = 1 and |Lose(7)| = &—1. We then say that Player i wins 
G if she has a winning strategy — a strategy that guarantees the satisfaction of 
yw; no matter how the other players proceed. Formally, f; is a winning strategy 
if for every profile 7 with f;, we have that Win(z) = {i}. 


Games may be non zero-sum, thus the objectives of the players may over- 
lap. In such games, we are interested in stable profiles. In particular, a profile 
mw = (fi,..-, fe) is a Nash Equilibrium (NE, for short) [21] if, intuitively, no 
(single) player can benefit from unilaterally changing her strategy. In our set- 
ting, benefiting amounts to moving from the set of losers to the set of win- 
ners. Formally, for i € [k] and a strategy f! for Player i, let [i < fi] = 
(fi,---5fi-1; fis fi+1,---, fk) be the profile obtained from m by changing the 
strategy of Player i to f/. We say that 7 is an NE if for every i € [k], if i € Lose(z), 
then for every strategy f/, we have that i € Lose(a|i + f/]). Thus, m is an NE 
if no player has an incentive to deviate from 7. For a subset W C [k] of players, 
we say that m is a W-NE if m is an NE with W = Win(z). 


The game G is turn-based if the transition function of its graph G is such that 
for every vertex v € V, there is a single player that “owns” v and determines 
the successor vertex whenever the play is in v. Formally, for every v € V, there 
is i € [k] such that for all moves (a1,...,a,) and (a{,...,a/,) that are possible 
for v, if a; = aj, then d(v, (a1,...,@xn)) = (v, (a4,...,a},)). Accordingly, we 
describe the game graph of a turn-based game as G = (AP, {Vi }iejk] vo, E, T), 
where V1,..., Vz is a partition of V to the sets of vertices owned by the different 
players, and E C V x V is the transition relation, modeling the fact that the set 
of actions of Player i in a vertex v she owned is the set of v’s successors. 
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3 Rational Synthesis 


Consider a k-player game G = (G, {Yi }ie[x]). We distinguish between three types 
of players: A player is controllable if she is guaranteed to follow a strategy as- 
signed to her. Otherwise, she is uncontrollable. The uncontrollable players are 
rational — they would not deviate from a profile unless they have a beneficial 
deviation from it. We distinguish between cooperative uncontrollable players, to 
which we can suggest a strategy (which they would follow unless they have a 
beneficial deviation), and non-cooperative uncontrollable players, to which we 
cannot suggest a strategy. The distinction between the cooperative and non- 
cooperative uncontrollable players may be induced by the architecture or the 
nature of the players. We denote by C,CU, and NU the disjoint partition of [k] 
into the classes of controllable, uncontrollable cooperative, and uncontrollable 
non-cooperative players, respectively. 

In rational synthesis, we seek a strategy for each of the players in C with 
which their objectives are guaranteed to be satisfied, assuming rationality of the 
other players. As we have the best interest of the players in C in mind, we assume 
that C 4 Ø. We say that a profile 7 = (fi,..., fk) is a C-fired NE, if no player 
in CUUNU has a beneficial deviation. Formally, we have the following. 


Definition 1. [Rational Synthesis] Consider a k-player game G = (G, {Wi}ie{x))- 
The problem of rational synthesis (RS) is to return a (CUCU)-profile n’ such that 
there is a C-fized NE that extends x’, and for every C-fixed NE n that extends 
m', we have that C C Win(z). 


Two special cases of rational synthesis have been studied in the literature. 
The first is cooperative rational synthesis, where all uncontrollable players are 
cooperative [11]. The second is non-cooperative rational synthesis, where all un- 
controllable players are non-cooperative [15]. 


Definition 2. [Cooperative Rational Synthesis] Consider a k-player game 
G = (G, {Wihieta)) with NU = Ø. The problem of cooperative rational synthesis 
(CRS) is to return a C-fixed NE m such that C C Win(z). 


Definition 3. [Non-Cooperative Rational Synthesis] Consider a k-player 
game G = (G, {Wihietn) with CU = Ø. The problem of non-cooperative rational 
synthesis (NRS) is to return a C-profile n’ such that there is a C-fixed NE that 
extends nm’, and for every C-fixed NE x that extends n', we have that C C Win(z). 


Remark 1. The original rational synthesis problem does not include a game 
graph [11]. Instead, the set AP over which the objectives are defined is par- 
titioned among the players, and at each round of the game, each player chooses 
an assignment to the subset of AP she controls. It is easy to see that this setting 
is a special case of our setting, taking the graph to have vertices in 24”. 


Remark 2. In previous work, the definition of NRS does not require the existence 
of a C-fixed NE that extends 7’ [15,18]. In some settings (in particular, turn- 
based games), the existence of such an NE is guaranteed. In others (in particular, 
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concurrent games) there need not be an NE in games with three or more players 
[7]. Note, however, that even with the requirement that a C-fixed NE that extends 
n’ exists, there is no guarantee that best response dynamics from 7’ would lead 
to such a C-fixed NE. 


As in traditional synthesis, one can also define the corresponding decision 
problems, of rational realizability, where we only need to decide whether the 
desired strategies exist. In order to avoid additional notations, we sometimes 
refer to RS, CRS, and NRS also as decision problems. 

For a set W C [k], we say that a solution to the rational synthesis problem is 
a W-solution iff it is a solution that guarantees the winning of exactly the players 
in W. In particular, a (CUCU)-profile z’ is a W-RS solution if it is an RS solution 
such that for every C-fixed NE ~ that extends 7’, we have that W = Win(z); a 
profile 7 is a W-CRS solution if m is a CRS solution such that W = Win(7); and 
a C-profile 7’ is a W-NRS solution if n’ is an NRS solution such that for every 
C-fixed NE 7 that extends 7’, we have that W = Win(z). 

It is easy to see that since the players in C are controllable, we can treat them 
as a single player with an objective that is the conjunction of the objectives of 
the players in C. Accordingly, in the sequel we assume that C = {1}. 


Remark 3. It is easy to add to the setting uncontrollable hostile players, namely 
players that, as in traditional synthesis, do not have an objective. Indeed, an 
uncontrollable hostile player is equivalent to an uncontrollable (cooperative or 
non-cooperative) player with objective 771. 


4 The Complexity of Cooperative Rational Synthesis 


In this section we study the complexity of CRS. Consider a k-player concurrent 
game G = (G, {Wi jiet). A strategy for Player i can be viewed as an Aj-labeled 
V-tree, and a profile can be viewed as an (A, x---x A;)-labeled V-tree. Formally, 
if m = (fi,..., fk) then for every node h € V* in the profile tree (V*, 7), we have 
m(h) = (fi(h),..., fe(h)), where (V*, fi) is the strategy tree that corresponds to 
fi. Note that Outcome(7) then corresponds to a path in (V*,7). 

Viewing profiles as labeled trees enables us to reduce CRS to the nonempti- 
ness of a tree automaton. Essentially, the automaton accepts all profile trees that 
are solutions to the CRS problem. We define the automaton by decomposing the 
solutions according to the set of players that win. Given a set W of players with 
1 € W, a profile 7 is a W-CRS solution iff 7 is a 1-fixed W-NE. Thus, iff exactly 
the players in W win in Outcome(z), and for every i ¢ W, Player i loses in 
Outcome(z|i + f!]), for every strategy f/. In Theorem 1 below, we construct 
automata that check these conditions. 


Theorem 1. Consider a set of players W with 1 € W. We can construct the 
following tree automata over (A, x --+- x Ap)-labeled V -trees: 


— An NBT Nw that accepts a profile tree (V*,7) iff Win(7) =W. The size of 
Nw is polynomial in |G| and exponential in |1|, |\e|,.-.,|Wxl- 
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— For every i ¢ W, a UCT W} that accepts a profile tree (V* n) iffi € 
Lose(r|i < fi]), for every strategy f. The size of Uy is polynomial in |G| 
and exponential in yil. 


Proof. We start with the NBT Nw. Recall that we want Mw to accept a profile 
tree (V*,7) iff exactly the players in W win in Outcome(z). Let Y = Ajew Wi A 
Niegw(v): and let A = (24°, Q, qo, u, œ) be an NBW of size exponential in |y] 
that corresponds to Y. The NBT Mw follows the outcome in the profile tree, 
and checks if the players in W are exactly the winners of the profile. Formally, 
Nw = (Ay x X Ak, V, V x Q,(v0,90),7,V x a), where for every (v,q) € 
V x Q and (a1,..., ak) E Ai X ++: X Ap, we have that n((v,q), (a1,---,@k)) = 
V q'en(a,r(v)) OU (a1, sreg ak)), (d(v, (a1, oe ,@k)), q')). 

We continue to the UCT U}. Recall that we want Uf, to accept a pro- 
file tree (V*,z) iff Player i loses in Outcome(m[i + f!]), for every strategy 
fi. Let U; = (24P Qi, q9, i, ai) and —-U; = (24P Si, 89, pi, Bi) be the UCWs 
corresponding to 7; and —7;, respectively. The UCT Ui, follows every possi- 
ble deviation for Player i, and checks that indeed she always loses. Formally, 
Ui, = (A1 xX- x Ak, V, V x Si, (vo, 8°),7,V x Bi), where for every (v,s) € 
V x Si and (a1,...,a@~) E Ai x ++: X Ag, we have that n((v,s), (a1,-..,@x)) = 
Nateni(v) Neste pus(sjr(vy) (Us (Oy ++ 5 0i- -3 0k) ) (OCW, (a1, +. jy +-+5@%)), 8°). 


Theorem 2. Solving CRS can be done in time polynomial in |G|, exponential 
in |Y], and doubly-exponential in |wWe|,..., |_|. The problem is EXPTIME-hard 
in |Yı| and 2EXPTIME-hard in each of |2|,.-.-,|Wxl- 


Proof. We start with the upper bound. It is easy to see that for every set W C [k] 
of players with 1 € W, there is a W-CRS solution iff the intersection of the au- 
tomata constructed in Theorem 1 is nonempty. We construct an NBT A such 
that L(A) 4 0 iff L(Nw) N N;gw Ly) # 9, and |A| is polynomial in |G], expo- 
nential in |q,|, and doubly-exponential in |49|,..., |¢,|. Since nonemptiness of 
NBTs can be checked in quadratic time [26], the upper bound follows. Moreover, 
when L(A) 4 0, the algorithm returns a witness to A’s nonemptiness, namely a 
profile tree that is a solution to the CRS problem. 

The construction of A involves two challenges. First, a naive analysis of the 
blow-up involved in translating UCTs to NBTs is exponential in the state space 
of the UCT. In our case, the state space of a UCT U% is of the form V x S, 
for some set S that is independent of G. Also, the V-component is updated 
deterministically: all states sent to the same direction v of the tree agree on their 
V-element. Consequently, the exponential blow up is only in the S-component, 
which depends only on |q;|. Second, the transformation of UCTs to NBTs that 
is described in [20] preserves nonemptiness, whereas here we need to preserve 
nonemptiness of an intersection of automata. As detailed in [17], where we coped 
with a similar challenge, this can be handled by parameterizing the construction 
in [20] by a rank (essentially, a bound on the size of transducers that generate 
trees in the language of the automaton) that corresponds to the size of the 
intersection. 
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We continue to the lower bounds, and we show they are valid already in the 
case k = 2. Proving an EXPTIME lower bound in |y|, we describe a reduc- 
tion from the membership problem for linear-space alternating Turing machines 
(ATM), defined in the full version. That is, given an ATM M with space com- 
plexity s : N — N and a word w, we construct a 2-player turn-based game 
G = (G, {v1, v2}), such that G and v2 are of a fixed size, 1 is of size linear in 
s(|w|), and there is a CRS solution in G iff M accepts w. 

Essentially, Player 1 and Player 2 generate a branch in the computation tree 
of M on w. Player 1 chooses the letters of the current configuration one by 
one, and chooses, at the end of each existential configuration, the successor 
configuration to which the branch continues. Player 2, on the other hand, only 
chooses successor configurations at the end of each universal configuration (see 
Fig. 1). The objective of Player 1 is to reach an accepting configuration, and the 
objective of Player 2 is to reach a rejecting configuration. 


Fig. 1. The game graph G. The circles are vertices controlled by Player 1, and the 
square is a vertex controlled by Player 2. 


We prove that G has a {1}-NE that satisfies yı iff M accepts w. First, if M 
accepts w, then the profile in which Player 1 follows a strategy that generates 
the configurations in the accepting computation and chooses the appropriate 
successors to existential configurations, is a {1}-NE that satisfies %4. Also, if M 
rejects w, then Player 2 can choose successors of universal configurations in a 
way that leads to a rejecting configuration. Thus, there is no {1}-NE in G that 
satisfies ~1, as either Player 1 loses by not forming a valid branch, or Player 2 
can deviate to a strategy where she wins and Player 1 loses. In the full version, 
we give the details of the reduction. 

Proving a 2EXPTIME lower bound in |2|, we use a reduction from decid- 
ability of 2-player zero-sum games, which is 2EXPTIME-hard already for a game 
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with a game graph of a fixed size [2]. Given a 2-player zero-sum game G = (G, 4%), 
we construct a 2-player game H = (H, {y, Y2}) such that the size of H is linear 
in |G], v1 is of a fixed size, W is of size linear in ||, and there is a CRS solution 
in H iff Player 1 wins G. Essentially, the game graph H contains two copies of 
G, and a new initial vertex in which Player 2 chooses between proceeding to the 
first or the second copy. 

Note that Player 1 has no influence in that decision. Then, the objective of 
Player 1 is for the play to be generated in the first copy, and the objective of 
Player 2 is for the play to be generated in the second copy and for the compu- 
tation to not satisfy q. 


Remark 4. Note that our algorithm finds W-CRS solutions for all W C [k], and 
so it is exponential in k. As shown in [9], rational synthesis is PSPACE in k 
already for rational synthesis with reachability objectives. 


5 The Complexity of Non-Cooperative Rational Synthesis 


In this section we study the complexity of NRS. We start with the turn-based 
setting, and then proceed the concurrent setting. 


5.1 Turn-based games 


As in the CRS case, we construct a tree automaton that accepts strategy trees 
that are NRS solutions. Here, however, the trees are labeled not only by a strat- 
egy for Player 1, but also by information that certifies that the suggested strategy 
is indeed a solution. Our construction follows the ideas developed for turn-based 
games in [9], adding to them a treatment of the LTL objectives (the latter is 
not too complicated, and our main goal in this section is to set the stage to the 
concurrent setting, which was left open in [9]). In order to present our solution, 
we first need some definitions and notations. 

Consider a k-player turn-based game G = (G, {Wi}iejx). Let G = (AP, 
{Vilice Vo Æ, T). Recall that for a subset S C [k] of players, an S-profile is 
a set of strategies, one for each player in S, and that a profile m extends an 
S-profile 7’ if the players in S use in m their strategies in 7’. The outcome of 
an S-profile 2’, denoted Outcome(z’), is the union of plays that are outcomes 
of profiles that extend a’. Thus, Outcome(z’) C V“ is the set of plays that are 
possible outcome of the game when the players in S$ follow their strategies in 7’. 

Consider a profile 7 = (fi,..., fk) and a prefix h € V* of Outcome(r). For a 
profile n’ = (f{,..., 4), we define the profile switch(z, 7’, h) = (ff,..., fÈ) as 
the profile in which the players first follow 7 and generate h, and then switch 
to following a’. Formally, for every x € V* and Player i € [k], if £ < h, then 
fi(x) = fi(x), and if x = h-y, then f?(x) = f{(y). Note that since the last 
vertices in x and y coincide, then switch(z, 7’, h) is well defined, in the sense that 
it returns only allowed actions. Also note that f/(h) = f/(e), thus, switching to 
following x’, we reset the history of the game so far. The strategies in nodes that 
are neither a prefix of h nor an extension of h are arbitrary and can follow r. 
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For i € [k] \ {1} and a prefix h € V* - V; of some play in Outcome({ fı }), 
we say that Player i wins from h if there exists a strategy f/ for Player i such 
that for every profile m = (fi,..., fk) with h < Outcome(7), we have that 
i € Win(switch(z, rli + fi], h)). That is, Player i wins in every profile in which 
the players first generate h, and then Player i switches to following f/, while the 
other players adhere to their strategies in the original profile. The strategy f/ is 
then called an h-winning strategy for Player i. 

Since turn-based games always have an NE, an NRS solution in G is a strategy 
fi for Player 1 such that for every 1-fixed NE a = (fi,...,f%), we have that 
1 € Win(z). Equivalently, for every profile 7 = (fi,..., fe), we have that either 
1 € Win(z), or there exists i € Lose(z) such that i € Win(z[i + f/]) for some 
strategy f! for Player i. As detailed in [9], this implies that a strategy fı for 
Player 1 is an NRS solution iff for every path p in Outcome({ fı }), either r(p) € 
L(w1), or there is i € [k] \ {1} such that r(p)  L(yi), and there are a prefix 
h < p and an h-winning strategy f! for Player i. We then say that h is a good 
deviation point for Player i, and f; is a good deviation for Player i. 

Our goal is to define a tree automaton that accepts a strategy tree for Player 1 
iff it is an NRS solution. The tree automaton should check that every path in 
Outcome({ fı }) that does not satisfy 7, has a good deviation point for one of the 
players that lose in it. For that purpose, a strategy fı of Player 1 is going to be 
certified by information about deviations: each path in Outcome({f1}) that does 
not satisfy Yı is labeled by a player i that loses in the path, a good deviation 
point for Player i, and a good deviation for Player i. Note that each deviation 
may handle only a subset of the paths below the good deviation point, and thus 
a subtree in the certified strategy tree may be labeled by strategies of different 
players, each deviating at different points. 

Formally, a certified strategy tree is a ((VU{@}) x [k])-labeled V-tree (V*, g), 
where each node is labeled by a pair (v, i), where v € VU {@} is a strategy-label, 
and i € [k] is a player-label. We use gs and gp to refer to the projection of g on 
the strategy and player components. Each path in the tree that corresponds to a 
play in Outcome({ fı }) has a suffix all whose nodes are labeled by the same player 
label. If this label is 1, then the strategy labels describe a strategy of Player 1 
and the path should satisfy 4. If this label is ¿ € [k]\ {i}, then a deviation point 
of Player i has been encountered, the strategy labels describe a good deviation 
for Player i, and the path should not satisfy ~;. As long as a deviation point has 
not been encountered, the strategy labels describe a strategy for Player 1 (and 
so, they are in V in nodes with a direction in V;, and are © in nodes with a 
direction not in V1). Once a deviation point for Player į is encountered (which 
is indicated by the strategy label being changed from © to a vertex in V ina 
node with direction V;), the strategy labels describe a strategy for Player i. 

By adjusting Lemma 7 in [9] to the setting with LTL objectives, we get the 
following. 


Theorem 3. A strategy fı for Player 1 is an NRS solution iff there is a certified 
strategy tree (V*,g) that agrees with fı. Thus, for every h E€ V* andv € Vi such 
that h -v € Outcome({f1}), we have that g.(h-v) = filh- v) 
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We now define a tree automaton that accepts certified strategy trees, which we 
then use for solving NRS. 


Theorem 4. We can construct a UCTU over ((V U{@}) x [k])-labeled V -trees 
such that U accepts a ((V U {@}) x [k])-labeled V-tree (V*,g) iff (V*,g) is a 
certified strategy tree. The size of U is polynomial in |G| and exponential in 


vil, lal,- -» [Wk] 


Proof. The requirements on a certified strategy tree (V*,g) for Player 1 can be 
decomposed to the following conditions. 


— (CÌ) For every i € [k] \ {1}, the subtree of every node h € V* - V; in the tree 
that is labeled by V x [k] is labeled by an h-winning strategy for Player i. 

— (Cg) The (infinite) suffix of every path in the tree is p-labeled by a single 
i € [k]. 

— (C}) For every i € [k] \ {1}, every path in the tree with a suffix p-labeled by 
i has a good deviation point for Player i. 

— (Ci) Player 1 wins in every path in the tree with suffix p-labeled by 1. 

— (C4) for every i € [k] \ {1}, Player i loses in every path in the tree with 
suffix p-labeled by i. 


In the full version, we describe UCTs that check these conditions and whose 
intersection is of the desired size. 


Theorem 5. Solving NRS can be done in time polynomial in |G| and doubly- 
exponential in |Wi|,...,|We|. The problem is 2EXPTIME-hard in each of |u|, 


-o [Val 


Proof. We start with the upper bound. By Theorems 3 and 4, we can reduce 
NRS to nonemptiness of a UCT U over ((V U {@}) x [k])-labeled V-trees of size 
polynomial in |G| and exponential in |q1|,|W2|,..., |W]. Using considerations 
similar to these used in the proof of Theorem 2 (in particular, the fact U is 
deterministic in its V-element), we can construct from it an NBT M of size 
polynomial in |G] and doubly-exponential in |W], |we2|,...,|W%| that preserves 
the nonemptiness of U. Since the nonemptiness problem for NBT can be solved 
in quadratic time [26], the desired complexity follows. 

We continue to the lower bounds, and we show they are valid already in the 
case k = 2. We again use reductions from deciding 2-player zero-sum games. 
In order to prove 2EXPTIME-hardness in |1|, consider a 2-player zero-sum 
game G = (G,w), for a fixed-size Œ. We claim that the 2-player game G’ = 
(G, {4}, true}) is such that G is of a fixed size and that there is an NRS solution 
in G’ iff Player 1 wins G. Indeed, since the objective of Player 2 is true, every 
profile m in G’ is a 1-fixed NE. So, in order for a strategy fı to be an NRS 
solution, it must satisfy that 1 € Win(( f1, f2)), for every strategy fo for Player 2. 
Equivalently, it is a winning strategy for Player 1 in G. 

In order to prove 2EXPTIME-hardness in |2|, consider again a 2-player 
zero-sum game G = (G,w), for a fixed-size G. We construct a 2-player game 
G’ = (H, {41, Y2}) such that H and 7 are of a fixed size, the size of we is linear 
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in |y], and there is an NRS solution in G’ iff Player 1 wins G. The game graph H 
is as in the proof of Theorem 2. Thus, it has an initial vertex from which Player 2 
chooses between two copies of G. The states of the first copy are labeled by a 
fresh atomic proposition p. Then, Yı = XGp, and p2 = X((w A Gp) v (F4) A 
G-p)). Thus, the objective of Player 1 is for the play to be generated in the first 
copy, and the objective of Player 2 is either to generate a play in the first copy 
whose computation satisfies w, or to generate a play in the second copy whose 
computation does not satisfy w. 

If Player 1 has a winning strategy fı in G, then there is an NRS solution fi in 
G’, where fi follows fı in the copy Player 2 chooses. Indeed, as fi guarantees the 
satisfaction of w for all possible behaviors of Player 2, a profile ~ is a 1-fixed NE 
only if Player 2 chooses the first copy. If Player 1 loses G, then for every strategy 
fı for Player 1, there is a strategy fo for Player 2 such that Outcome((f1, f2)) 
does not satisfy w. So, for every strategy fı for Player 1 in G’, we have that there 
is a 1-fixed NE r = (fi, fo) such that 1 € Lose(7), where fz is the strategy that 
chooses the second copy, and ensures that w is not satisfied. Hence, there is no 
NRS solution in G’. 


5.2 Concurrent games 


Consider a k-player concurrent game G = (G, {vi}ieta). Let G = (AP, V, vo, 
{Aihe {Kitic 6,7). As our constructions in this section are loaded with 
notations, we simplify the setting and assume that there is one set A of actions, 
available to all players in all vertices. That is, Ay = Ag =--- = Ay = A, and 
for every v € V and i € [|k], we have that «;(v) = A. All our constructions and 
results can be easily extended to the general case. 

As in the turn-based setting, we define a UCT that accepts certified strat- 
egy trees for Player 1. In the concurrent setting, however, certification is much 
more complicated. Below we explain the challenges in the concurrent setting 
and how we overcome them. For i € [k] \ {1} and a prefix h € V* of some 
path in Outcome({fi}), we say that Player i wins from h if for every profile 
mw = (fi,..-, fk) with h < Outcome(z), we have that there exists a strategy f/ 
for Player i such that i € Win(switch(, mfi + f/],h)). Thus, Player i wins from 
h if she has a beneficial deviation to switch to from h, for every profile 7 with 
h < Outcome(z). Note that for different profiles, Player i might have different 
such beneficial deviations. Here, however, the prefix h need not end in V; (in 
fact, there is no V; in the concurrent setting). We say that h is a winning point 
for Player i. Also, we say that (h, (fi(h),..., fe(h))) is a good deviation pair for 
Player i iff there exists ai € A such that h - d(h, (fi(h),...,a4,...,fe(h))) is a 
winning point for Player i. 

In order to understand better the difference between NRS solutions in the 
turn-based and concurrent settings, recall that a strategy fı for Player 1 is not an 
NRS solution iff there is a 1-fixed NE a = (f1,..., fg) whose outcome p does not 
satisfy %1. Note that a being a 1-fixed NE means that for every prefix h-v-u of p, 
there exist actions (a2,...,ax) € A*~+ such that d(v, (fi(h-v),@2,--.,a%)) =u 
and for every i € Lose(p), we have that (h-v, (fi(h-v), a2,...,@,)) is not a good 
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deviation pair for Player i. In particular, we can choose a; = f;(h-v). Hence, 
if fı is an NRS solution, and there exists a path p E€ Outcome({ fı }) that does 
not satisfy Yı, then there must be a prefix h-v-u < p such that for every 
(az,..-,@~) € AFTI with d(v, (fi(h-v),a2,-.-,@x)) = u, there exists i € Lose(p) 
such that (h - v, (fı(h- v), a2,...,ap)}) is a good deviation point for Player i. We 
then say that h-v-u is a good deviation transition for Lose(p). Thus, while in the 
turn-based settings it is sufficient to find in every path in which Player 1 loses 
a good deviation point for one of the players that lose in it, in the concurrent 
setting the definition of good deviation depends on the transition induced by the 
specific profile being used, and so we have to consider deviating transitions, and 
there may be several players in Lose(p) that deviate. Accordingly, in order to 
certify a strategy for Player 1, we should describe a mapping from every vector 
of actions to a set of players, along with their deviations. 

Another difference between the turn-based setting and the concurrent setting 
is that only in the first, the existence of some 1-fixed NE is guaranteed [7]. Hence, 
we have to add to the algorithm such a check (which is in fact easy). 

We can now define certified strategy trees for the concurrent setting. Every 
node in a certified strategy tree is labeled by the following components: 


1. An action a; € A, which is the strategy for Player 1. 

2. A deviation function d : AF7! — (AU {L})*7!, which maps a vector of 
actions of players 2,...,k to the set of players that deviate from it, along 
with their deviations. Specifically, (az,...,a,) E€ AFT! being mapped to 
(a,...,a,) E€ (AU{L})*? indicates that for every i € [k] \ {1}, if ai € A, 
then a}, is the deviation for Player i from (a2,...,a,), and if ai = L then no 
deviation from (a2,...,@,) is specified for Player i. Let D denote the set of 
all possible deviation functions. 

3. A set LC {2,...,k} of players, which describes the set of players that lose 
in a given path and which are therefore expected to have a good deviation 
transition. That is, if a suffix of a path is labeled by 0, then Player 1 should 
win in this path, and if a suffix of a path is labeled by L Æ Ø, then all the 
players in L lose in this path. 

4. A vector of actions (a2,...,a,~) E A*~+, which describes the strategies for 
the other players in the required 1-fixed NE. 


Formally, a certified strategy tree is a (Ax Dx 2'?-*} x A¥-1)-labeled V-tree 
(V*,g), where each node is labeled by both a strategy-label a, € A, a deviation- 
label d € D, a player-label L € 21%}, and an NE-label (az,...,a,) € A®—!. We 
use Js, Jd; Jp, and gy fF to refer to the projection of g on its different components. 

For a node h-v that is s-labeled by a; and d-labeled by d, a possible successor 
u of v, and a set of losers L, we say that h-u-u is marked as a good deviation 
transition for L iff the following hold: 


1. For every (a2,...,a%) E AFT! such that 6(v, (a1,a2,...,a%)) = u, there 
exists 7 € L such that (d((a2,...,ax))); € A. That is, for every vector of 
actions (a2,...,@%) that leads to u, there is i € L such that d assigns a 
deviation for Player i from (a2,...,a,). 
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2. For every i € L, there is a vector of actions (a2,...,a,) € AFT! such that 
ô(v, (a1, @2,.-.,@%)) = u, and (d((az,...,a%))); E€ A. That is, for every i € L 
there exists a vector of actions that leads to u, from which d assigns a 
deviation for Player i. 


Now, an (Ax Dx 212: »F} x A*-!)-labeled V-tree (V*, g) is a certified strategy 
tree iff it satisfies the following conditions: 


— (Ci) Ifh-v-u € V* is marked as a good deviation transition for a set of 
players L C {2,...,k}, L Æ Ø, then it is indeed a good deviation transition 
for L. 

— (Cg) Every path p has a set L such that p is eventually always p-labeled by 
L. 

— (C§) For every L C {2,...,k}, L Æ 0, every path in the tree with a suffix 
p-labeled by L has a good deviation transition for L. 

— (Ci) Player 1 wins in every path in the tree with suffix p-labeled by 0. 

— (Ci) For every i € [k] \ {1}, Player i loses in every path in the tree with 
suffix p-labeled by L such that 7 € L. 

— (Cs) The s and NE-labeling of the tree specifies a 1-fixed NE. 


Theorem 6. A strategy fı for Player 1 is an NRS solution iff there is a certified 
strategy tree (V*,g) that agrees with fı. That is, for every h E€ V*, we have that 


gs(h) = filh). 


Theorem 7. We can construct a UCT over (AX D x212: ¥} x A¥-1)-labeled V- 
trees that accepts a (Ax D x 21-5} x A¥-1)-labeled V -tree (V*,g) iff (V*,g) is 
a certified strategy tree. The size of the UCT is polynomial in |G| and exponential 
in lil, l2], mey IYr]: 


Proof. We can construct UCTs for C2, C}, and CÌ that are very similar to 
the UCTs for C2, C$ and CÌ from the turn-based setting. The UCT for Cs is 
similar to the UCT for CRS solutions. In the full version, we describe in detail 
a UCT that checks the satisfaction of C1. Then, the conjunction of the above 
UCTs results in a UCT that accepts certified strategy trees. 


The number of deviation functions d : AF71 + (AU {1})*! is exponential 
in |A|. Consequently, the UCT described in Theorem 7 has an exponential al- 
phabet, which causes the NBT generated in [20] to have exponentially many 
transitions, making its nonemptiness problem exponential. In order to overcome 
this problem, we introduce vertical annotation of certified strategy trees, which 
essentially replace a node labeled by d € D by a sequence of nodes, labeled by a 
smaller alphabet. 

Explaining our vertical annotation, we find it clearer to go back to a detailed 
description of the actions of the different players, thus refer to A; and «; rather 
than assuming they are all equal to A. For every vertex v € V and an action 
a, € kı (v) for Player 1, we denote by Tia, the set of possible vectors of actions 
from v, given Player 1 chose a1. That is, Ty a, = {ai} X kalv) x +- xX KR (uv). We 
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ay 


(a3’,..., ap’) 


(a3’, ..., a2’) (a3’,..., ap"), (v’, L) 


(I, be’) 


v' VA ON 


Fig. 2. A vertically certified strat- 
egy tree. Information about devia- 
tions from a given vector of actions 
is stored in an intermediate node 


that corresponds to that vector. 
There is no good deviation tran- 
sition starting from v. 


Fig. 3. h- v-v’ is a good deviation 
transition for L, where the root v 
is reachable via history h. 


order the vectors in Ts a, arbitrarily, and, for 1 < i < |Ty,a,|, denote by t’,,, the 


v,a 
i-th item in Ty, a,- We also denote by t 


i a, lj + ai] the vector of actions obtained 


Vi,aL 
from ti, a, by replacing the action for Player j by a}. 

For a given transition from v to v’, let Ty,w/.a, be the restriction of Toa 
to vectors (a1,a2,...,@ķ) such that 6(v, (a1, a@2,...,a,)) = v’, and let t 


v,v' a1 
denote the i-th item in Ty,w,a,- Also, let T = U,ey Unyeni(v) Toia and X = 


A, U ((A2 x +++ x Ag) UUjeL li} x Az) x (ØU (V x 2{?;--k})), We also need an 
additional 2¢?--*} component, for annotating the set of losers in each path in 
the tree, but we omit it for now, as it is not relevant for the vertical annotations. 

A certified strategy tree is now a X-labeled (V U T)-tree, where nodes with 
direction in V are labeled by the strategy for Player 1, and nodes with direction 
in T are labeled with deviation information. Consider a node that corresponds 
to a vertex v with history h and is labeled by a1. Following vertically there are 
nodes corresponding to Tua, each labeled by a vector of actions (see Fig. 2). 
This information is for verifying good deviation transitions. That is, after a 
good deviation transition is announced, we need to verify that the involved 
players indeed have appropriate beneficial deviations. The last node in the chain, 


which corresponds to ploe I is either not labeled, or labeled by a vertex v’ and 
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a set L of losers. If it is not labeled, it means there are no good deviation 
transitions from v, in which case we continue to the nodes corresponding to 
the possible successors of v. If it is labeled by (v’,L), it means that h- v-v’ 
is a good deviation transition for L (see Fig. 3). Hence, the following nodes 
correspond to Ty,./,a,, each labeled by a single deviation, followed by a chain 
of good deviation transitions, using the appropriate V x 212k} annotations, 
or by nodes corresponding to successor vertices. In the full version, we describe 
formally a UCT for verifying good deviation transitions in vertically annotated 
certified strategy trees. 


Theorem 8. Solving NRS in the concurrent setting can be done in time poly- 
nomial in |G| and doubly-exponential in |1|, |\We|,...,|We|. The problem is 2EX- 
PTIME -hard in each of |1|, |W2|,---, |wel- 


Proof. We start with the upper bound. We can easily modify the UCTs for C2, 
C4, C}, and Cs from the proof of Theorem 7 to accommodate the vertical anno- 
tation. With conjunction with the UCT for verifying good deviation transitions, 
described in the full version, we have a UCT U over X-labeled (V UT)-trees such 
that U accepts a X-labeled (VUT)-tree (V“, g) iff (V*, g) is a vertically annotated 
certified strategy tree. By Theorem 6, there is an NRS solution in G iff U is not 
empty. The size of U is polynomial in |G| and exponential in [y1], |¢e|,..-,|wWel, 
and its alphabet is polynomial in Œ. Also, U is deterministic in its V-element. 
Hence, as detailed in the proof of Theorem 2, we can construct from U an NBT 
N of size polynomial in |G| and doubly-exponential in |w1|, |¢2|,..., [Vk] such 
that L(U) 4 0 iff L(V) 4 0. Since the nonemptiness problem for NBT can be 
solved in quadratic time [26], the desired complexity follows. 

Finally, as turn-based games are a special case of concurrent ones, the lower 
bound from Theorem 5 applies here. 


5.3 General rational synthesis 


Consider a k-player game G = (G, {Vi }ie[x]). Recall that the problem of rational 
synthesis is to return a ({1}UCU)-profile 7’ such that there is a 1-fixed NE that 
extends 7’, and for every 1-fixed NE 7 that extends 7’, we have that 1 € Win(z). 

A ({1}UCU)-profile 7’ is an RS-solution iff for every path p such that p |F 41, 
there is no 1-fixed NE z that extends 7’ and Outcome(r) = p. It is easy to 
see that we can define certified ({1} U CU)-profile trees in a similar way we 
defined certified strategy trees for Player 1, inducing an algorithm of the same 
complexity for checking the existence of a certified ({1} U CU)-profile tree. Also, 
as NRS is a special case of RS, the NRS lower bound provide a lower bound for 
RS. Hence, we can conclude with the following. 


Theorem 9. Solving RS can be done in time polynomial in |G| and doubly- 
exponential in |wW1|,|we|,...,|We|. The problem is 2EXPTIME-hard in each of 
Iyl, pal, seag IYr]. 
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Abstract. In multi-agent settings, such as IoT and robotics, it is nec- 
essary to coordinate the actions of independent agents to achieve a joint 
behavior. While it is often easy to specify the desired behavior, pro- 
gramming the necessary coordination can be difficult. This makes coor- 
dination an attractive target for automated program synthesis; however, 
current methods may produce strategies that issue useless actions. This 
paper develops theory and methods to synthesize coordination strate- 
gies that are guaranteed not to initiate unnecessary actions. We refer to 
such strategies as being “compact.” We formalize the intuitive notion 
of compactness, show that existing methods do not guarantee compact- 
ness, and propose a solution. The solution transforms a given temporal 
logic specification using automata-theoretic constructions to incorporate 
a notion of minimality. The central result is that the winning strategies 
for the transformed specification are precisely the compact strategies for 
the original. One can therefore apply known synthesis methods to pro- 
duce compact strategies. We report on prototype implementations that 
synthesize compact strategies for temporal logic specifications and for 
specifications of multi-robot coordination. 


1 Introduction 


Imagine a future home where devices are network-controllable and the control 
program is synthesized from requirements. Suppose that the homeowner asks 
for the living-room lights to be turned on when it gets dark. To meet this re- 
quirement, a control program must necessarily coordinate the on/off state of the 
lights with readings from an illumination sensor. 

This specification may be expressed more precisely in linear-time tempo- 
ral logic (LTL) as G(dark = Xlight-on).? Here “dark” is a proposition that 
represents a reading from the sensor, and is therefore an input to the control 
program, while “light-on” is a proposition that represents an action, and is there- 
fore an output of the control program. Abstracting this formula to the shape 
G(a = Xb), the left half of Figure 1 shows the smallest state machine that 


3 G and X are, respectively, the temporal always and next-time operators. Actions are 
assumed instantaneous for simplicity. 
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meets this specification. It represents a control program that entirely ignores the 
sensor input and leaves the lights on all day! This strategy is clearly undesir- 
able, although technically it does meet the specification. The machine on the 
right represents the “commonsense” controller that keeps the lights on only as 
long as the sensor indicates that it is dark. The two machines are equally valid 
from the viewpoint of correctness. How then should we distinguish them? And 
how can a synthesis method avoid generating undesirable solutions? Those are 
the questions addressed in this paper. 


true aa 


a 
3 Ze 
=a 
(i) (ii) 


Fig. 1. Non-compact (left) and compact (right) machines for G(a = Xb). The initial 
state is indicated by a thick border. Output actions are listed at each state; input 
conditions are placed on the edges. 


We suggest that the crucial distinguishing factor is that the left-hand machine 
invokes actions that are not essential to satisfying the property. For instance, if 
the input a is false now, there is no need to invoke action b in the next step. If 
input a remains false, there is no need to invoke action b at all. It is vital to avoid 
useless actions in the domains of IoT and robotics, where agents interact with 
the physical world: there is no need to switch on a toaster when only watering 
the lawn is asked for. Indeed, switching on the toaster unexpectedly may have 
dangerous side effects. A reader may easily imagine other similar situations. 

We refer to the policy of avoiding unnecessary actions as compactness. Strate- 
gies that satisfy this property while meeting the specification are called compact. 
An immediate question is whether compactness is ensured by standard synthe- 
sis methods. Unfortunately, the answer is ‘no.’ Bounded synthesis [35,20], for 
instance, will produce the smallest satisfying Mealy or Moore machine; in this 
setting, the solution of Figure 1(i). We have validated this experimentally with 
the tool BoSy [19]. Quantitative synthesis (cf. [6]) finds solutions that are worst- 
case optimal, i.e., programs where the maximum cost, over all input sequences, 
is the lowest possible. (Dually, programs where the worst-case reward is the 
highest possible.) Letting each action invocation have unit cost, a quantitative 
method cannot distinguish between the solutions shown, as both have the same 
maximum cost for the input where a is always true. We make this analysis pre- 
cise subsequently, and show that average-case optimality also does not always 
distinguish compact from non-compact solutions. We have validated this exper- 
imentally with the tool QUASY [13]. Hence, compactness cannot be defined in 
quantitative terms: the synthesis of compact strategies requires new methods. 


48 K. S. Namjoshi and N. Patel 


At its core, the issue of compactness is a variation of the well-known frame 
problem in logic-based AI [29]. The natural way to express the example require- 
ment is as G(a = Xb). However, the semantics of temporal logic allows many 
satisfying interpretations; among those is the undesirable one of Figure 1(i). This 
tension between the freedom of interpretation allowed in logic and the natural- 
ness of a specification is at the heart of the frame problem. One approach to 
achieving compactness is therefore to write a tighter specification, which per- 
mits fewer interpretations; e.g., to write the stronger assertion G(a = Xb). 
But this is not a natural choice. Moreover, reworking a specification by hand 
to rule out interpretations with unnecessary actions is difficult as the process 
is not compositional: i.e., one cannot rework portions of a specification sepa- 
rately. The specification transformation defined here performs such a tightening 
automatically, using automata-theoretic constructions. 


The motivating application of compactness is to the synthesis of centralized 
coordination programs. As formalized in [3], in a coordination problem, a group 
of independent agents, denoted A;,...,A,, are guided by an additional synthe- 
sized agent, C, so that their joint behavior meets a temporal specification ¢. 
That work describes a specification transformation from y to y’ that incorpo- 
rates asynchronous agent behavior and other constraints. This transformation, 
however, does not guarantee compactness. We take the transformed problem as 
the starting point for our investigations, and consider the more general question 
of how to generate a compact solution for a given temporal specification. 


We begin by proposing a mathematical definition of the compactness prop- 
erty. Generalizing from the example, one can consider a strategy to be compact if 
for each input sequence, the sequence of actions produced as output (1) meets the 
specification and (2) cannot be further improved. We formalize the second no- 
tion as minimality with respect to a supplied “better than” preference relation 
between two output sequences. This formulation is closely related to formal- 
izations of commonsense reasoning, in particular the notion of circumscription 
introduced by McCarthy in [28]. 


For coordination problems, a natural preference relation is based on the sub- 
set ordering on sets of actions. We say that sequence y is better than sequence 
x if (1) in each step, the actions issued in y are a subset of the actions issued by 
x and (2) for at least one step, the actions in y are a strict subset of the actions 
in z. The smallest compact strategy for G(a = Xb) under this preference rela- 
tion issues action b precisely when input a is true at the prior step. Otherwise, 
there is a point where a is false but b is issued at the next step. Removing this 
occurrence of b produces a better sequence that also satisfies the property. This 
is precisely the strategy defined by the machine in Fig. 1(ii). Alternative prefer- 
ence relations may order sets of actions by size, or order sequences of actions by 
the substring relation. One may also compare infinite action sequences by cost 
(limit average or discounted sum) using comparator automata [2]. The choice of 
preference relation is driven by the application domain. To accommodate various 
options, compactness is parameterized by the preference relation. 
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Technically, a temporal specification y can be viewed as a language L, a set 
of infinite words over a joint input-output alphabet. For a preference relation 
< over infinite words, it is natural to formulate the language min(L, <) that 
contains only the minimal words in L with respect to the preference relation. 
The central theoretical result in this paper is that there is a compact strategy 
satisfying L if, and only if, there is a strategy satisfying min(L, <). This theorem 
reduces the question of synthesizing compact strategies to a standard synthe- 
sis question, making it possible to use existing synthesis algorithms to produce 
compact strategies. We give sufficient conditions under which min(L, <) is reg- 
ular when L is a regular language, and show how to effectively construct a finite 
automaton for the minimal language and for its complement, from either an au- 
tomaton or an LTL formula for L. The constructed automata can also be used to 
model-check whether a given control program defines a compact strategy. More- 
over, the transformation makes it possible to modularly apply quantitative or 
other criteria for synthesis from min(L, <); for instance, to synthesize compact 
strategies that minimize program size or worst-case execution time. 

We have implemented these constructions and used them to synthesize com- 
pact strategies for LTL specifications and for a class of specifications that arise 
in multi-robot coordination. Experiments show that compact strategies exist for 
many specifications and can be effectively computed, albeit with some added 
overhead. We also experiment with approximation methods which are simpler 
and avoid potential worst-case exponential blowups in the general construction. 

In our view, the main contributions of this work are in bringing attention 
to the need for compactness in program synthesis; showing its independence 
from existing criteria; giving a precise formulation in terms of minimality; and 
in designing and implementing algorithms to synthesize compact strategies. 


2 Background 


Automata A finite automaton is a tuple (Q, X, Â, ô, F) where Q is a set of 
states; X is a set of letters, an alphabet; Q is a non-empty set of initial states; 
6CQx Sx Q isa transition relation; and F is a non-empty set of final states. 

A word over X is a (possibly empty) sequence of letters from X. For a word 
w, its length |w| is the number of letters in w if w is finite and w if w is infinite. 
We assume the standard definition of a run of the automaton on a word. If w is 
finite, a run on w is accepting if the last state of the run is in F; if w is infinite, a 
run is accepting by the Biichi condition if a state in F occurs on the run infinitely 
often. The language of an automaton is the set of words for which there exists 
an accepting run. One typically distinguishes between the finite-word language 
and the infinite-word language of an automaton. An automaton is deterministic 
if there is exactly one initial state and for every q and a, there is at most one q’ 
such that (q,a,q’) is in ô. 

We use the standard abbreviations DFA, NFA and NBA for a deterministic 
automaton, a nondeterministic automaton over finite words, and a nondetermin- 
istic Büchi automaton over infinite words respectively. 
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LTL Linear Temporal Logic (LTL) is a logic defined over a set of atomic propo- 
sitions, AP. The logic has the following minimal grammar, where p € AP: 


fi=p | firfe| fi | Xfi | fru fe 


The satisfaction relation is defined over infinite words where each letter is a 
subset of AP. It has the form w, i | f for a word w and a natural number i, and is 
given by structural induction on formulas. We omit the standard definition. The 
language of a formula is the set of words that satisfy it. Standard constructions 
compile an LTL formula to an NBA that accepts the same language (cf. [18,39]), 
possibly incurring an exponential blowup. 


Programs as Transition Systems A program is represented by its state transition 
system. This is a Moore machine, defined as a tuple (S, S,1,0,R, o) where S' is 
a set of states, S is a non-empty set of initial states; I is a set of input values; 
O is a set of output values; R C S x I x S is the transition relation, which must 
be total on J; and o : S — O is the output mapping. An execution of this system 
is an unbounded alternating sequence of states and inputs, and takes the form 
$0, 40, $1, 41,---, such that for each i, the triple (si, ai, Si+1) is in the transition 
relation. A computation is an execution of this form where sọ is an initial state. 


Input-Output Words An input-output word (i/o word for short) is a pair of se- 
quences (a, b) where a is a sequence of inputs, b is a sequence of outputs, and |b| = 
1+ |a|. The input-output word induced by a program execution so, ao, $1, @1,.-- 
is the pair (a,b) with b = o(so),0(s1),.... We sometimes write an i/o word in 
the linear format bo, ao, b1, @1,... for clarity. It is also common (cf. [32]) to view 
an infinite i/o word (a,b) in the “zipped” form a b = (ao, bo), (a1, 01),.... For 
a temporal property y defined over input and output predicates and program 
M, the program M satisfies y, written M | y, if the zipped input-output word 
of every computation of M satisfies y. Each atomic proposition is a function in 
I x O > Bool; an i/o pair (a,b) induces the set of propositions {p | p(a,b)}. 


Games and Strategies A strategy is a function from finite sequences of inputs 
to outputs, represented as ø : I* — O. For an infinite input sequence a = 
ao, a1, . . . the strategy o induces the infinite output sequence denoted o(a), given 
by a(€), o(ao), 7(ao,a1),.... A play for input a is the i/o word (a,b = a(a)). We 
sometimes abuse this notation and use o(a) to refer to the play induced by a. 
A play is winning for a temporal property ọ if it satisfies this property when 
viewed as a zipped i/o word. A strategy ø is winning for y if for every input a, 
the play on a is winning for g. 

The realizability question is: given a property y, determine whether there 
exists a program satisfying y. The synthesis question is: given a property y that 
is realizable, construct a program that satisfies y. A strategy ø induces a deter- 
ministic program with an infinite state space, denoted P(o) = (S,$,1,O, R,0). 
The state space S is the set of finite input sequences J*, the initial state is e, 
the output label for state x is o(x) and the transition relation R is given by 
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{(x,a,xa) | x € I*,a € I}. This is in fact an infinite complete tree over I 
(sometimes called a “fulltree”) where each node is labeled by an output value. A 
labeled fulltree, in turn, corresponds to a strategy and a deterministic program. 


Synthesis Methods for Temporal Properties There is an extensive literature on 
methods to synthesize programs from LTL specifications (cf. [34,32,31,26] and 
tools that implement various algorithms (cf. [33,8,17,27,35,20]), all based on the 
conversion from LTL formulas to equivalent automata. 

The classical approach to realizability of temporal properties (which we only 
sketch here, cf. [34,32]) is via the connection between programs, strategies, and 
labeled fulltrees. If a property vy is realizable, there is a deterministic program M 
satisfying y. This program may also be seen as a strategy and a fulltree. From 
a deterministic word automaton with the same language as y, one constructs 
a tree automaton that accepts precisely the fulltrees that satisfy p. Now y is 
realizable if and only if the language of this tree automaton is non-empty. For 
properties in LTL, this procedure can be carried out in 2EXPTIME in the length 
of the formula y; the problem is 2EXPTIME-complete [32]. A winning strategy 
can be extracted as a finite state, deterministic reactive program from the tree 
automaton, thanks to the finite-model property of temporal logic. This approach 
is implemented in the tool Strix [30]. 

Two other approaches have been developed. One is to limit the logic: the 
GR(1) fragment expresses many useful properties, has a lower complexity (DEX- 
PTIME), and can be implemented easily using symbolic (BDD-based) meth- 
ods [31]. This is implemented in several tools [33,8,17,27]. The bounded synthesis 
method applies to full LTL and is iterative in nature. By placing bounds on 
the size of the intended program and the ranking argument for formula satis- 
faction, one obtains a simpler safety game, which can be solved using symbolic 
methods [26,35,20]. The approach is implemented in [11,19]. 

We use two of the approaches described above in this work. The classical 
approach is used to determine compact realizability of an arbitrary LTL formula, 
while GR(1) approach is used in the multi-robot setting. 


3 Compactness 


We formulate compactness for temporal specifications, investigate its properties, 
and show how to synthesize a compact strategy through a specification transfor- 
mation. We consider specifications on infinite words for simplicity and to match 
the semantics of temporal logic. 

A relation <o over the set of infinite output words is a preference relation if 
its transitive closure ag is irreflexive. We informally say that word b is better 
than b’ if b <5 b' holds. As the transitive closure is irreflexive, it is not possible 
for a word to be better than itself, matching intuition. This relation is extended 
to input-output words as follows. An i/o word (a,b) is better than an i/o word 
(a',b') if (1) the input sequences a and a’ are identical, and (2) b <ġ b’. The first 
condition ensures that comparable words have the same input sequence, which 
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is important as we are ultimately interested in the i/o words that are generated 
as plays of strategies. 


Definition 1 (Compact Strategy). A strategy o is compact for an i/o lan- 
guage L if (1) o is a winning strategy for L and (2) for every input sequence a, 
there is no i/o word (a,b’) such that (a, 0’) satisfies L and (a,b’) is better than 
the i/o word (a,b = o(a)) that is produced as the play of o on input a. 


The first condition ensures that ø is a valid strategy for L; the second that 
a compact strategy produces the “best possible” output for each input. We say 
that a language L is compactly realizable if it has a compact strategy. 


Theorem 1. A language L is realizable if it is compactly realizable. The con- 
verse does not hold. 


Proof. From right-to-left, consider a compact strategy o for L. From the defini- 
tion, ø is a winning strategy for L, hence L is realizable. 

The converse does not hold. Let the input set J = {0,1} and the output set 
O = {c,d} with the output preference ordering c < d extended point-wise to 
output words. Let the specification L consist of sequences of the form c(0c)” 
and d({0,1}d)”. This is realizable. No winning strategy can produce c on € as 
there can be no win on input 1”. The single winning strategy produces d on 
every input sequence, including e. But this strategy is not compact: for input 0% 
it generates d(Od)”, but there is the better word c(0c)” in L. 


Standard realizability is monotone: if L’ C L and program M satisfies L’, 
then M also satisfies L. However, compact realizability is neither monotone nor 
anti-monotone (proof in the full version). As is the case with deduction systems 
for commonsense reasoning (cf. [37]), non-monotonicity is a consequence of the 
formulation in terms of minimality. 

The simple example from the Introduction is easily extended to a collection 
of N “if-condition-then-action” requirements. The IFTTT service (https: //ifttt. 
com) or Apple Shortcuts implement these operationally, using an event-driven 
rule engine. However, from the viewpoint of temporal logic and synthesis, the 
results can be unexpected, as we have seen. The N requirements in LTL have 
the shape (Ai : G(a(i) = Xb(i))). The smallest model is one with a single 
state, issuing all the b actions unconditionally. This is clearly unintended. The 
intended model, which is compact, has 2" states, one for each subset of the b 
actions. Thus, the gap between the smallest non-compact and compact models 
can be exponential in the length of the specification. 

We now show the main theorem that links compact and standard realizability 
through a specification transformation. 


Definition 2 (minimal language). For a language L over alphabet X and a 
preference relation < on X'-words, the minimal elements of L form the language 


min(L,<)={z | re L Any: yEeLAy xt z)} 


I.e., a word x is in min(L, <) if it belongs to L and there is no word y in L that 
is transitively better than x. 
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Theorem 2. Language L is compactly realizable if and only if min(L, <) is 
realizable. 


Proof. (Left-to-right) Let o be a strategy that compactly realizes L. Consider 
any input sequence a. The output b = o(a) produced by the strategy is such that 
there is no word in L that is better than (a,b), by the definition of compactness. 
Hence, (a, b) is in min(L, <). As this holds for each input sequence, ø is a winning 
strategy for min(L, <). 

(Right-to-left) Let ø be a winning strategy for min(Z,~<). For any input 
sequence a and its corresponding output b = o(a), the word x = (a,b) must 
satisfy min(L, <). By the definition of min, we have that (1) x also satisfies L. 
Moreover, (2) there is no i/o word y that is better than x and also satisfies L. 
From (1) and (2), ø is a compact strategy for L. 


3.1 Effective Minimality Constructions for LTL 


Theorem 2 implies that one can reduce compact realizability to standard realiz- 
ability. Given a temporal specification y, we transform its language L(y) to the 
language C(y) = min(L(y), <). Starting from an LTL formula f, we give two 
constructions: one for the minimal language C(f), the other for its complement. 
The constructions assume that the relation <+ can be expressed as an NBA, 
which is the case for the preference order defined in the Introduction. 

The first construction directly follows Definition 2. The left-hand term (x € 
L(f)) is fulfilled by the standard conversion from LTL formula f to an NBA 
Agp. For the right-hand term, we use the same NBA Ap, now re-defined over 
y, for the y € L(f) term; intersect this with the NBA for <*; then project 
onto x and complement to obtain an NBA for the right-hand conjunct. The 
intersection of these two NBAs provides an NBA for C(f). These steps may result 
a worst-case double exponential blowup in the size of f: the first exponential is 
in the construction of Ay; the second is in the complementation step. A similar 
construction applies if the specification is given directly as an NBA. 

The second construction produces an NBA for the complement of the minimal 
language, with “only” a worst-case single exponential blowup. The complement 
of C(f) is (from the definition) {x | x ¢ L(f) v (Ay: y € L(f) Ay xt x)}. 
For an LTL formula f, one constructs NBAs Ay and A~, for the LTL formulas 
f and =f, respectively. An NBA for (Ay: y € L(f) A y <7 2) is obtained as in 
the first construction by omitting the final complementation step. The union of 
this NBA with the NBA for A_; gives an NBA for the complement of C(f). 

The NBA for the complement of C(f) can be used to model-check whether 
a given strategy is compact. It can also be used to synthesize machines using 
bounded synthesis, which requires an NBA for the complement of the specifi- 
cation property. The worst-case blowups are unavoidable: that follows from a 
lower-bound result by Birget [5] and a simpler but less general result of ours, 
discussed in the full version of this paper. 
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3.2 Relationship to Quantitative Synthesis 


The formulation of compactness is in terms of a qualitative notion of minimality. 
A natural question is the relationship to methods for synthesis with quanti- 
tative objectives; in particular, methods for producing programs with optimal 
worst-case or average-case behavior [6,13]. Expanding on the argument in the 
Introduction, we establish that worst-case optimality cannot always distinguish 
between compact and non-compact solutions to a given specification. 

In quantitative formulations, the synthesis game is formulated so that each 
transition has an associated reward. The reward of an infinite computation is 
defined using standard cumulative metrics such as mean-payoff (the limit of 
average rewards over successively longer prefixes) or discounted sum (the sum 
of rewards over the computation discounted geometrically, i.e., the k’th reward 
contributes a factor d*, where d € (0,1) is the discount factor). The objective is 
to find a winning strategy with maximum worst-case reward, where the worst- 
case reward is the minimum reward over all inputs. In the stochastic form of 
the game, an additional probabilistic player “Nature” is introduced, and the 
objective is to find a winning strategy with the maximum average-case reward, 
where the average is the expectation taken over the induced probability space. 
Precise definitions of these concepts can be found in [6]. 


Worst-case optimality We return to the example discussed in the Introduction. 
There, we had assumed for simplicity that each action set is assigned a cost that 
is its cardinality. However, the reasoning carries over to any cost function that is 
monotonic with respect to set inclusion: i.e., if A C B then cost(A) < cost(B). 
Intuitively, monotonicity captures the preference for choosing a smaller set of 
output actions. Consider the mean-payoff cost of an infinite execution where 
the input a is always true. For the non-compact program in Figure 1(i), it is 
obvious that the limit of the average cost is cost({b}). That is also the case for 
the compact program in Figure 1(ii): the fact that the initial cost is cost(Q) is 
swamped in the limit. This is the worst case input for both programs by the 
monotonicity of the cost function. The best case for the program on the right is 
when the input a is almost everywhere false. Thus, worst-case optimality cannot 
distinguish between the two programs for any monotonic cost function. 


Average-case optimality We now show that average-case optimality also cannot 
always distinguish between compact and non-compact strategies. The general 
principle is that if a strategy is non-compact only for a finite prefix of a compu- 
tation, its average-case cost in the limit will be the same as the cost of a strategy 
which performs in a compact manner throughout. 

Consider the input set J = {0,1}. Suppose that inputs are chosen uniformly 
at random. The output set O is the set of subsets of the action set A = {a,b}. 
Let the specification be the following: the initial choice of output set is either {a} 
or A; all subsequent outputs must be A. There are only two winning strategies, 
which differ only in their choice of initial output (either {a} or A); both produce 
output A subsequently regardless of the input. Assuming unit cost per output 
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action, the average cost of a run of length n is thus (1 +2(n — 1))/n for the first 
strategy and 2 for the second. In the limit, both strategies have average cost 2, 
although the first is compact, while the second is not. This argument also applies 
for an arbitrary but monotone cost function. 

In our view, quantitative measures are best suited to modeling the real cost 
of actions rather than to modeling a preference ordering. The two may, however, 
be combined to good effect. As compactness is ensured with a specification trans- 
formation, one can modularly apply quantitative synthesis to the transformed 
specification min(L, <) to obtain strategies that are compact and also optimal 
with respect to a cost metric. 


3.3 Approximating Compactness 


The worst-case exponential blowups can make it difficult to produce compact 
strategies. Moreover, Theorem 1 asserts that there are specifications that are 
realizable but have no compact strategies. For both reasons, we describe methods 
by which one can approximate the compactness criterion. 


Approximately Minimal Languages The first method is to tighten the lan- 
guage L to L’ that lies between L and min(L, <); we call L’ approximately 
minimal for L. We synthesize a program satisfying L’. Given an NBA A for L 
over alphabet I x O, we construct an NBA A whose language is approximately 
minimal for L. This construction applies only to a class of preference relations 
that are induced pointwise by a partial order < on individual letters of the 
output set O. 

For infinite i/o words w = (a,b) and w’ = (a’,b’), define w <p w’ iff (1) 
for all i, a; = a; (inputs are identical) and b; < 6/, and (2) there is some i for 
which b; < bi. We say that w <p w if w <p w’ or w = w’. The ordering <p is 
transitive and regular. It is easy to construct an automaton accepting <p, which 
checks condition (1) at each position of the zipped word w ™ w’, and accepts 
only if condition (2) holds at some position on the zipped word. The subset and 
cardinality preference relations introduced earlier are of this type. 

Given an NBA A recognizing L, the NBA Å is constructed by excluding 
certain transitions of A. Specifically, a transition (q, (a1, 61), q') of A is omitted 
in A if there is a “better” transition (q, (a2, b2), q’) in A with a1 = ag and bz < bı. 
Automaton A can be efficiently constructed from A by performing a single pass 
over 6. The set of states, initial states and final states are identical in A and A. 


Theorem 3. For a pointwise preference order < over O, L(A) is an approxi- 
mately minimal language for L. 


Proof. It is easy to see that L(A) C L(A), as an accepting run in A is also an 
accepting run in A. 

For the other inclusion, let w be in min(L, <p). Then, w is also in L. Thus, 
there is an accepting run p for w in A. If all transitions in p are present in A, 


then w is also in L(A). If not, there is a transition (q, (a1, b,),q’) at the k-th 
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step of p (for some k) that is not present in A. By construction, there must 
be a transition (q, (a1, b2),q’) in A such that b2 < bı. Now consider the run p’ 
that is generated by swapping transition (q, (a1,b1),q’) with (q, (a1, b2),q’) at 
the k-th step. This is also an accepting run, on a word w’ that is identical to w 
except that it has (a1, b2) rather than (a1, b,) as its k-th entry. As w’ is in L and 
w’ <p w, it cannot be the case that w is in min(L, <p), a contradiction. 


Minimal Strategies for L The second method searches greedily for compact 
strategies in a game graph for L. For strategies o and o’, say that o C ø’ (read as 
“g is better than o’”) if for all input sequences a, o(a) = o/(a) or o(a) <* o' (a). 
I.e., the output on input a is using ø is at least as good as that using o’. The 
minimal elements according to this ordering are called minimal strategies for L. 
It is easy to show that every compact strategy for L is a minimal strategy for 
L. The converse does not hold. 

The greedy construction applies to a game graph for L where strategies are 
memoryless (e.g., if synthesis for L is a safety game or a parity game), and if the 
preference order is pointwise, as defined above. The core idea is simple: compute 
the set of winning positions; then nondeterministically and greedily extract a 
strategy by choosing only those transitions between successive winning positions 
that are output-minimal with respect to <. In the full version, it is shown that 
any strategy extracted in this manner is minimal for L. 


4 Evaluation 


4.1 Multi-Robot Coordination 


Our original motivation to investigate compactness comes from an application to 
multi-robot orchestration. Due to space limitations, we describe this setting in 
brief. One has available multiple, heterogeneous robots, each capable of carrying 
out certain actions, some of which cannot be allowed to overlap. The goal is 
to perform specified tasks by (a) assigning robots to carry out actions and (b) 
sequence the actions appropriately. Tasks are described in a simple declarative 
language, called Resh [12], that has been implemented and used to control groups 
of mobile robots. A useful subset of Resh is given by the following grammar, 
where A is the set of action names and R is a set of robot names. 


S:i=a3R|S3S8|S&S|S|S|S+8 


The interpretation of these operators is in terms of a finite-word input-output 
sequence. A term a — R is interpreted as “perform action a using one of the 
robots in R.” For this, a control strategy chooses a robot r in R, and produces a 
“(begin a on r)” output event. Action duration is not fixed: E.g., the time taken 
to perform a “move to position p” action may vary as the robot maneuvers 
around humans. The completion signal is a “(end a on r)” event that is an input 
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to the control strategy. The other operators are interpreted as = (sequenc- 
ing), & (concurrent), | (choice), and + (concurrent with both tasks starting 
together). The interpretation of each operator produces a regular language. 

The finite-word semantics is appropriate for robotics tasks that must be 
performed to completion. The same observation motivates the use of LTLf (a 
finite-word variant of LTL) in [38] to specify robotics tasks. A winning control 
strategy is one that satisfies the semantics of the operators. 

As action completions are uncontrolled, even a simple specification such as 
a — R is unrealizable if the completion signal is never issued by an adver- 
sarial environment. It is thus necessary to restrict the environment so that ev- 
ery initiated action is eventually completed. This assumption must be inter- 
preted over infinite words. It has the shape of a conjunction of LTL formulas 
G(begin(a,r) = XF end(a,r)) over all actions a and robots r. This can be rep- 
resented by a DBA which tracks the set of pending (i.e., begun but not ended) 
actions. This DBA is worst-case exponential in the number of action-robot pairs, 
but in practice is limited by the concurrency in the specification. 

In order to match the infinite-word environment constraint, the Resh system 
specification must be extended to infinite words. This is done by saying that an 
infinite word w satisfies the specification if there is a prefix x of w such that 
x satisfies the specification. Being a regular language, a Resh specification is 
representable as a DFA; this is extended to infinite words as a DBA by replacing 
the outgoing transitions of each final state with a self-loop on all inputs. 

We have arrived at the final form of the synthesis question, which has the 
shape € = S, where € (the environment assumptions) and S (the system 
specification) are both representable as deterministic Büchi automata. That is 
precisely the general form of a GR(1) specification [31]; therefore, algorithms for 
GR(1) synthesis can be applied to synthesize finite-state controllers. 


Implementation and Experiments. Our initial experiments in synthesis with 
(€ = S) occasionally produced non-compact strategies, which motivated this 
exploration of compactness. We now use the modified specification E > S, 
where the system portion is made approximately compact through the con- 
struction in Section 3.3, which preserves the GR(1) format. This specification 
produces compact strategies for all cases we have examined. 

Our implementation of GR(1) synthesis uses a SAT solver, similar to the 
method of [10]; we found this to be significantly faster than BDD-based meth- 
ods. As there is not a well-defined set of benchmarks for robotics or Resh specifi- 
cations, we generate 500 specifications at random, producing specifications with 
parse-tree depth 4, biased slightly to prefer the sequencing operation (i.e., =) 
over the others, as is likely to be the case in practice. 

The system specification is set up to have two robots. Actions are allowed 
to overlap, which implies that all specifications are realizable. Of the 500 speci- 
fications, the GR(1) game graph was generated for 428 (85%) within a timeout 
limit of 5 minutes for each specification. (The Resh-to-automaton construction 
uses BDDs to symbolically represent output event sets, which sometimes blows 
up.) All 428 game graphs are solved by the SAT-based GR(1) procedure within 
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a timeout of 5 minutes per game. The median solution time is 3 seconds; 90% 
are solved within 30 seconds; and all are solved within 225 seconds. We also 
experimented with a small hand-designed group of specifications where certain 
action overlaps are forbidden, which are also resolved efficiently. 


4.2 Compactness for LTL 


We now describe an implementation of a compact synthesis pipeline for general 
LTL specifications. Our experiments use the benchmarks from the SYNTCOMP 
(2020) competition.* In these experiments, the preference order is fixed as the 
pointwise subset order. We were forced to make this arbitrary choice as there 
is limited information about the origin of the benchmark problems, so we could 
not tailor the ordering to the problem domain. 

The goal is (1) to determine the difficulty of constructing a compact synthesis 
pipeline for LTL, and (2) to gauge the practical feasibility of the compact synthe- 
sis procedures. The experiments are designed to answer the following questions 
that arise from (2): (Q1) What is the overhead on generating compact strategies 
compared to standard synthesis? (Q2) Is the approximation procedure more ef- 
ficient than exact compactness? and (Q3) How effective are the approximate 
constructions at producing compact strategies? 


NBA for DPA for 


determinize synthesize { Apx. compact 


strategy for f 


min(L(f), <) 


min(L(f), <) 


determinize & 
complement 


DPA for 
min(L(f), <) 


synthesize 


Compact 
strategy for f 


Is M compact 
for f? 


modelcheck(M = Ref Model(f)) 


Fig. 2. An overview of the workflow for our experiments and tool. In the figure, min 
refers to the approximate minimal language, while Ref Model(f) refers to the reference 
model for formula f. 


A high-level overview of the internal structure of our tool is in Fig. 2. Our 
implementation chains together several known tools: the automaton libraries 
SPOT (v. 2.9.5) and Owl (v.20.06)[25], the synthesis tool Strix (v. 20.10)[30] 
and the model checker NuSMV (v. 2.6.0) [14]. We also use the AIGER toolkit [4] 
as well as the Syfco synthesis format converter°®. We are grateful to the authors 
for making these tools freely available. 


* At https://github.com/SYNTCOMP/benchmarks. 
5 At https://github.com/reactive-systems/syfco 
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Our tool offers three main features: (1) Compact Realizability: given an 
LTL formula f, determine if f is compactly realizable. This feature uses the 
compactness transformation from Section 3.1 to produce an automaton for the 
complement of the minimal language, which is then complemented, determinized 
and synthesized using Strix; (2) Compactness Test: given an LTL formula f 
and a candidate program P, determine if P is a compact program for f; and 
(3) Approximate Compact Realizability: given an LTL formula f, generate 
an approximately compact strategy. Here we implement the construction of the 
approximate minimal automaton from Section 3.3. 


Our experiments were carried out on a Linux VM running Ubuntu 20.04 with 
12 GB of memory. Naturally, we only consider synthesizing compact strategies 
for specifications that are realizable ĉ. The results can be summarized as follows: 
(A1) We compare the efficiency of compact synthesis to the standard synthe- 
sis by evaluating the number of specifications that can be synthesized within a 
certain time limit. We fix this time limit to be 10 minutes, and use Strix for stan- 
dard synthesis. (With this limit, the entire run over the benchmarks takes several 
hours.) Strix determines realizability for 396 specifications out of 421 (~ 94%), 
while our tool determines compact realizability for 213 (~ 50%). (A2) Within 
the same time limit, the approximation technique determines realizability for 
398 specifications, significantly more than for exact compactness and about the 
same as for standard realizability. (A3) We model-check the strategies generated 
through approximate compact realizability. Model-checking for compactness re- 
quires the complement minimal automaton of a specification, so we set the time 
limit of 10 minutes per specification to generate this automaton. Within this 
limit, our tool manages to construct the required automaton for 246 specifica- 
tions. Generating approximate compact strategies for these 246 specifications, 
and applying the Compactness Test on these strategies, we find that ~ 42% of 
the synthesized strategies are compact. 

In addition, we tried our tool on the generalized version of the example 
specification from the introduction (Ai : G(a(i) = Xb(i))). Our tool can 
synthesize a compact strategy till N = 8 fairly quickly, after which our setup 
struggles to compile the original LTL formula to an NBA. On the same set of 
specifications, the approximate techniques also produce a compact strategy. 

The implementation process was fairly straightforward, a pleasant surprise 
given the number of tools and format conversions involved. We had to patch 
some tools to extend their capabilities (e.g., to allow automata as specifications) 
and to implement format conversions. 

In summary, compact synthesis is feasible for a substantial number of spec- 
ifications. Where it is not — due either to blowups in automaton construction 
or due to the gap between normal and compact realizability — one can use the 
approximation procedure defined in Section 3.3 to generate strategies that are 
minimal with respect to the strategy ordering. 


6 We refer to the helpful classification of these benchmarks into realizable and unre- 
alizable ones from https: //github.com/meyerphi/syntcomp-reference/. 
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5 Related Work 
We discuss closely related work in synthesis and commonsense reasoning. 


Qualitative Temporal Synthesis There is a considerable literature on the synthe- 
sis of open reactive programs from LTL specifications, starting with the seminal 
work by Pnueli and Rosner [32]. The beautiful theoretical results are made prac- 
tical by the discovery of efficient algorithms for the GR(1) subclass [9,31], and 
procedures for bounded synthesis [21,36], based on so-called “Safraless” proce- 
dures [26]. These algorithms have been implemented in several tools, 

e.g., [11,15,16,19,23,33,27]. Our work builds on this basis by transforming the 
search for compact strategies to a standard synthesis question that can be han- 
dled by these tools. 

In the robotics domain, prior work investigates synthesis for an interpreta- 
tion of LTL over finite words called LTLf [22,38,40]. Although Resh is similarly 
restricted to finite-word properties, a central difference is that specifications in 
LTLf (like LTL) are defined over propositions on robot and world state, and not 
in terms of actions of an unknown duration. 

There are many ways to choose between satisfying models: e.g., [7] designs 
synthesis procedures that produce minimally vacuous models. While the formu- 
lations differ, there is a common thread in the notion of minimality with respect 
to an ordering over models. 


Quantitative Temporal Synthesis A substantial body of work in temporal syn- 
thesis is focused on quantitative objectives. These problems are represented by 
games where each action has an associated cost (or, dually, reward) and the 
objective is to find strategies that minimize cost (or, maximize reward) (cf. [6]). 
There are several ways to formulate appropriate cost/reward functions and cor- 
respondingly many ways to solve such games. One could attempt to model com- 
pactness by assigning costs to actions such that if word x is better than word y 
then x has the lower cost. We chose not to develop solutions along such quan- 
titative lines for two main reasons: first, as the connection between cost and 
preference is indirect, setting up the right cost assignments to model a desired 
preference ordering is difficult; secondly, the theoretical complexity and practi- 
cal difficulty of quantitative synthesis is high. Instead, we chose to tackle the 
question in a qualitative manner. 

As shown in Section 3, quantitative measures cannot always differentiate 
between compact and non-compact solutions. Using the specification transfor- 
mation developed here, the two methods can, however, be used in cooperation: 
one can model the real costs of actions in a manner that is orthogonal to the 
preference ordering and compute minimal-cost, compact strategies. 

A recent work [1] focuses on the “quality” of satisfaction of an LTL formula 
(e.g., preferring to satisfy one part of a specification over another). Synthesis is 
through a reduction to a standard LTL specification; unfortunately this has a 
worst-case exponential blowup. 
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Non-Monotonic Reasoning. As mentioned briefly in the introduction, the com- 
pactness criterion is a form of commonsense reasoning: one does not expect 
synthesized solutions to include unnecessary actions. Commonsense reasoning 
is exemplified by the classical frame problem, introduced in [29], which shows 
that the freedom of interpretation given by logic must be restricted in order to 
achieve commonsense conclusions. 

It was soon recognized that such restrictions imply a non-standard notion of 
deduction, which is not monotonic: adding new hypotheses can invalidate current 
conclusions [37]. In [28], McCarthy suggests a formulation in terms of a circum- 
scription operation: each inference is guarded with a “not(abnormal)” predicate, 
and a successor state is one where the extent of this predicate is minimized—.e., 
abnormal effects are maximally limited while avoiding inconsistencies. Logically, 
this is specified in second-order logic as y(A) A =(AB: B C AA y(B)), where y 
is the specification and A is the abnormality predicate. Readers will immediately 
notice the similarity to the definition of min(L, <). 

The importance of a general preference order in place of the fixed subset re- 
lation is laid out in [24]; the authors propose reasonable properties that any non- 
monotonic inference relation should meet, and show that a definition in terms of 
a preference ordering satisfies those properties. Our formulation of compactness 
is based on similar notions of minimality over a preference ordering on words. 
This is at the root of the non-monotonicity of compactness. These similarities 
hint at deeper connections between compactness and non-monotonic common- 
sense reasoning; we aim to investigate those in future work. 
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Abstract. Motivated by applications in boolean-circuit design, boolean 
synthesis is the process of synthesizing a boolean function with multi- 
ple outputs, given a relation between its inputs and outputs. Previous 
work has attempted to solve boolean functional synthesis by converting a 
specification formula into a Binary Decision Diagram (BDD) and quan- 
tifying existentially the output variables. We make use of the fact that 
the specification is usually given in the form of a Conjunctive Normal 
Form (CNF) formula, and we can perform resolution on a symbolic rep- 
resentation of a CNF formula in the form of a Zero-suppressed Binary 
Decision Diagram (ZDD). We adapt the realizability test to the context 
of CNF and ZDD, and show that the Cross operation defined in earlier 
work can be used for witness construction. Experiments show that our 
approach is complementary to BDD-based Boolean synthesis. 


Keywords: Boolean synthesis - Binary decision diagram - Zero-suppressed 
binary decision diagram - Quantifier elimination - Resolution. 


1 Introduction 


Boolean functions are widely used in electronic circuits, and thus in many as- 
pects of computing, to describe operations over binary values. Often the most 
natural way to express such an operation is as a declarative relation between in- 
puts and outputs. Implementing these operations in practice, however, requires 
a functional, rather than declarative, representation. The process of constructing 
a function that generates outputs directly from inputs, based on a given declar- 
ative relation between them, is called boolean synthesis. For example, boolean 
synthesis can be applied in constructing a full logical circuit from a relational 
specification [9,15] or an unknown intermediate component in an existing log- 
ical circuit [12]. Boolean synthesis is also useful for computing certificates for 
quantified boolean formulas (QBF), and advances in QBF solving and boolean 
synthesis are motivated by each other [3,20]. 

Formally, we are given a specification f(Z,¥), from B™ x B” to B, relating 
two sets of boolean variables. The specification holds true if and only if ¥ is a 
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correct output for the inputs %. We solve the synthesis problem following the 
convention of splitting it into two sub-problems [9]: 


1. Realizability: constructing the realizable set R c B™ of input assignments Z 
for which there exists an output assignment y such that f(%,y) = 1. 

2. Witness construction: constructing a witness function g : B” > B” that 
computes an output ¥ = g(ċ) from an input ¢ e€ R such that f(z, y) = 1. 


Given a propositional formula f as the relational specification, we aim to syn- 
thesize a boolean function g that is correct by construction, meaning that as 
long as the input is realizable the output will satisfy the specification. 

Prior work solved the boolean functional synthesis by converting the specifi- 
cation formula into a Binary Decision Diagram (BDD), defined in Section 2, and 
quantifying the output variables existentially [9]. BDDs constitute a formalism 
for representing Boolean functions, supported by mature tools such as CUDD 
[22]. The size of a BDD representing a formula can, however, be exponential 
in the number of variables. Oftentimes, it is even not possible to construct the 
BDD before starting to solve the problem [9]. Noticing how this blow-up in BDD 
size has restricted the potential of existing BDD-based synthesis algorithms, we 
seek to develop an algorithm that reduces the impact of this exponential blowup. 
Hence we look for an alternative data structure that might be more promising 
in representing boolean formulas compactly. 

We identify here Zero-Suppressed Binary Decision Diagram (ZDD) [16], de- 
fined in Section 2, as such an alternative approach. ZDDs have been shown to 
sometimes outperform BDDs in the context of QBF solving [19]. Unlike BDDs, 
which represent a boolean formula semantically via the set of satisfying assign- 
ments, ZDDs are designed to encode sets of sets [14], allowing them to rep- 
resent syntactically a formula in Conjunctive Normal Form (CNF) as a set of 
clauses, which are themselves sets of literals. This means that it may require an 
exponential-size BDD to represent a CNF formula, which can be alternatively 
compactly encoded as a polynomial-size ZDD representation. 

It can be expected, however, that this more compact representation comes at 
a cost. Since ZDDs do not represent the solution sets directly like BDDs do, solv- 
ing realizability and synthesis over this representation might require additional 
effort. With this in mind. we perform here a full investigation comparing ZDDs 
and BDDs for boolean synthesis. We focus on the following research questions: 


1. How do the sizes of the ZDD and BDD representations compare, and how 
does this affect the time of compiling the formulas into the diagram repre- 
sentation? Are ZDDs always more compact? 

. In realizability, how do ZDDs vs. BDDs perform, in time and space? 

. How do ZDDs perform, compared to BDDs, in witness construction? 

. How does the end-to-end synthesis performance of ZDDs compare to BDDs? 

. For scalable families of formulas, how does the time and space performance 
scale as the formula grows, comparing ZDDs to BDDs? 


oR Wh 


Our synthesis problem can often be expressed as boolean synthesis for CNF 
specifications, as the boolean specification in synthesis problems is often given 
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in CNF form, and even non-CNF specifications can be easily converted to CNF. 
Once specification formulas are given in CNF, it is possible to perform real- 
izability by using the resolution operation, which is equivalent to existentially 
quantifying the output variables directly from the CNF formula. Each resolu- 
tion step increases the number of clauses quadratically. But when a ZDD is 
used to represent the CNF formula, even when the number of clauses increases 
quadratically, the size of the ZDD tends to increase to a lesser extent. 

The crux of our contribution is a boolean-synthesis algorithm that performs 
resolution on a symbolic, ZDD-based representation of CNF formulas. To solve 
the first sub-problem of realizability, we compute the set R c B™ of all realizable 
inputs, and then check the full and partial realizability of the input domain. The 
realizable set is generated by applying resolution to the ZDD representation of 
the CNF formula, based on operations defined in previous work [4,5,19]. 


The second sub-problem requires construction of a witness function g : B™ > 
B” for the output variables 7 € B”. We adapt the formulas defined in previous 
work [9] to the context of CNF, eliminating one output variable y; € B at a time, 
and make use of the fact that resolution is equivalent to existential quantifica- 
tion. In this way we can extract a witness g; : B” —> B for variable y; without 
abandoning the ZDD representation. 


After substituting the witness of an output variable back in the formula, we 
need to compute the next witness. This leads to our next challenge, which is how 
to guarantee that the formula remains in CNF after performing this substitution. 
The overall form of the entire formula after substitution is dependent on the 
form of the substituted witness function g;: clauses where y; is positive can be 
converted back to CNF if g; is also in CNF, but clauses where y; is negative 
require g; to be in Disjunctive Normal Form (DNF). Thus, what we need are 
two equivalent witness functions, one in CNF and the other in DNF. 

Our solution is to use the Cross ZDD operation, first defined by Knuth [14]. 
We show that if the Cross operation, defined on “families of sets” [14], is applied 
to a ZDD representation of a CNF formula, then the result can be interpreted 
as the ZDD for an equivalent DNF. In this way, with the Cross operation, we 
can use the CNF version of a witness for positive occurrences of a variable, and 
use the equivalent DNF version for negative occurrences, while both preserving 
the equivalence and ensuring that the resulting formula remains in CNF. 

Our experimental evaluation confirms the advantages of ZDDs in compila- 
tion, thanks to their linear size and direct correspondence to the CNF formula 
structure. As expected, this more compact representation can come with a trade- 
off of increasing the difficulty of constructing witnesses. Therefore, in synthesis 
performance, neither ZDDs nor BDDs dominate across the board, each per- 
forming better in different families of formulas. We therefore advocate for the 
ZDD-based approach as an addition to the portfolio of boolean synthesis tools, 
serving as a complement to BDD-based approaches [11]. 

As shown in related works on boolean synthesis, there exist alternative tools 
including CegarSkolem [13], BFSS [1] and Manthan [10]. Our focus of com- 


parison here is, however, on improvements to decision-diagrams based tools 
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for boolean synthesis, rather than tools based, for example, on QBF solvers. 
Decision-diagram based approaches enjoy some unique advantage. For example, 
decision diagrams facilitate partitioned-form representation [23]. Also, decision 
diagrams can be used as intermediate-step representation in temporal synthesis 
[24]. These unique advantages justify, we believe, our focus here on decision- 
diagram based approaches. We return to this point in our discussion of future 
work. 


2 Preliminaries 


Boolean Formulas and Functions. Boolean formulas and boolean functions are 
built upon the boolean set B = {0,1}. We identify a boolean formula f(%) over 
m propositional variables ț = (x1,...,£m) with the boolean function f :B™ > B 
such that f(a) = 1 for an assignment å = (a1,...,@m) € B™ if and only if å is a 
satisfying assignment to ý in the formula. Two boolean formulas f and f’ are 
logically equivalent if they represent the same boolean function (and therefore 
have the same set of satisfying assignments). Substitution of a boolean expression 
d(z) in place of a variable x; in a boolean formula f(%) is denoted by f[x; + d] 
and defined by f[a; + d](@) = f(a1,...,2i-1, d(%), Vis1,---,;Lm). 


Conjunctive and Disjunctive Normal Forms. A literal is either a variable or 
the negation of a variable. A clause is a disjunction of literals, and a cube is a 
conjunction of literals. A boolean formula in the form of a conjunction of clauses 
is said to be in Conjunctive Normal Form (CNF), and a boolean formula in the 
form of a disjunction of cubes is said to be in Disjunctive Normal Form (DNF). 


Definition 1 (Boolean Synthesis Problem). Given a boolean formula f (Z, ¥) 
in CNF with m+n boolean variables, partitioned into m input variables z 
(£1,...,£m) and n output variables Y = (y1,.--,Yn), construct: 


1. The set RC B™, called the realizability set, of all assignments a c B™ to z 
for which there exists an assignment beB” to y such that f (a,b) =, 

2. A function g : B®” > B” such that f(G,g(@)) = 1 for all å € R. This is 
called a witness function. In practice, arbitrary formulas can be converted to 
equi-realizable CNF formulas with a linear blowup using Tseytin encoding, 
quantifying existentially over Tseytin variables. The witnesses for the equi- 
realizable formula can then be used for the original formula. 


Binary Decision Diagrams. A (Reduced Ordered) Binary Decision Diagram 
(BDD) [2] is a directed acyclic graph that represents a boolean function. In- 
ternal nodes of the BDD represent boolean variables, and paths on the BDD 
correspond to assignments, leading either to a terminal node 1 if satisfying or 
0 if unsatisfying. We assume that all BDDs are ordered, meaning that variables 
are ordered in the same way along every path, and reduced, meaning that su- 
perfluous nodes are removed and identical subgraphs are merged. Given these 
two conditions, BDDs are a canonical representation, meaning that two BDDs 
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with the same variable order that represent the same function will be identi- 
cal [2]. The variable order used can have a major impact on the BDD’s size, and 
two BDDs representing the same function but with different orders can have an 
exponential difference in size. 


Zero-Suppressed Decision Diagrams. A Zero-Suppressed Binary Decision Dia- 
gram (ZDD), is a data structure first defined in [16]. ZDDs are similar to BDDs 
but use a different reduction rule: while BDDs remove nodes where both edges 
point to the same child, ZDDs remove nodes where the 1-edge (edge assigning the 
variable to 1) points directly to the 0-terminal. Specifically, the 0-ZDD encodes 
formulas that are always valid, and the 1-ZDD encodes contradiction. 


Semantics on Families of Sets. ZDDs can be used to implicitly represent families 
of subsets of a set S, where the variables in the ZDD correspond to elements 
of S that can be either present or absent in a subset [14]. For a ZDD Z, we 
denote by [Z] the family of subsets represented by Z. We define [0] = @ and 
[1] = {Ø} for the terminals 0 and 1, respectively. Using Z(x, Zo, Z1) to denote 
a ZDD with variable x as the root, ZDD Zp as the 0-child and ZDD Z; as the 
1-child, we define [7(a, Zo, 71) ] = [Zo] U{{r} va | a € [2] }. Note that using this 
interpretation every subset in the family corresponds to a path to the terminal 
1 on the ZDD. Since CNF formulas can be viewed as sets of clauses, where a 
clause can be viewed as a set of literals, we can use ZDDs to represent CNF 
formulas syntactically. When representing a formula in CNF by a ZDD, for each 
atomic proposition p we treat its positive and negative literals p and (~p) as 
two distinct variables x, and £p. Then every path leading to the 1-terminal 
corresponds to a clause in the CNF formula, where x; connects to its 1-edge in 
the path if and only if the literal l is in the corresponding clause. 


ZDD Operations. We use standard ZDD operations such as Subset0, Subset, 
Change, Union, Intersect, and Difference, defined previously in [17] and imple- 
mented in the CUDD package [22]. In terms of families of sets, Subset0(Z, x) 
returns the family of all sets a such that a € [Z] and x ¢ a, and Subset1(Z, x) 
returns the family of all sets a \ {x} such that a € |Z] and x € a. Change(Z, x) 
returns the family {au {x} | a € [Z] and z ¢ga}u {ax {x} |a e [Z] and z € a}. 
The operation Resolution(x, Z) returns the ZDD representing the result of ap- 
plying resolution to variable x in the CNF represented by Z. It is implemented 
following [4], using the operations SubsumptionFree Union, which takes the union 
of two families of sets while removing subsumed sets, and ClauseDistribution, 
which returns the family of sets resulting from applying distribution over two 
given sets of clauses. The witness-construction phase also requires the Cross op- 
eration defined in [14] to convert between CNF and DNF representations. See 
Section 4 for details. 
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3 Realizability Using ZDDs 


We describe in this work a ZDD-based algorithm to solve the Boolean-Synthesis 
Problem described in Definition 1. This means that the specification f(Z,¥), 
the realizability set R and the witness function g are all represented by CNF 
formulas encoded as ZDDs, as defined in Section 2. In this section we describe 
how to compute the realizability set R and analyze it to answer whether the 
specification is partially or fully realizable. In Section 4 we describe how to 
compute the witness function g. 


3.1 Realizable Set R 


In order to construct the set R of realizable assignments to the input variables 
Z, as described in Definition 1, we need to quantify existentially the output 
variables 7, analogously to the BDD-based approach of [9]. 

Let fo,..-; fn be CNF formulas such that 


nes 
fn-1 = (Byn) f 


fi = (Byizı) --- (Gyn-1) (Sun) f 


fi = y2) +> Gyn-1) (Gyn) f 
fo = Bui) --- Gyn-1) (Fyn) f 


As in [9], the last formula fo = (Sy1)...(4yn-1) (Syn) f implicitly represents the 
realizable set R, describing the set of satisfying assignments of fo. 

To compute fo,..-, fn as CNF formulas, we apply the resolution operation, 
which is equivalent to existential quantification [7]. We first state a normal-form 
lemma. 


Lemma 1. [4] Let f be a CNF formula. Let f} denote the conjunction of all 
clauses a such that (pv a) is a clause in f. Let fp denote the conjunction of all 
clauses B such that ((=p) v B) is a clause in f. Let f, denote the conjunction 
of clauses y in f where neither p nor (ap) is a literal in y. Then f is logically 
equivalent to (pv fy) A (py fp) A i for a boolean variable p. 


Proof. The claim follows from [4]. 


Next we show how to use resolution to existentially quantify a variable from 
a formula in the normal form of Lemma 1. 


Lemma 2. Let y be a boolean variable, then the boolean formula (Sy) ((yv fy) A 
(~y v fy) A fy) is logically equivalent to (fy V fy) A fy) 
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yy y fy) 4 (uy fy) 4 fy) 

= (Sy) (Cyr ry) y (VA fy) v fy Amu) y Fy A fy )) 4 fy) 

= (Byly a fy) y GYS Aw y GYNI AAA fy (fy excludes y) 

= (Gyu a Fa) y GWU au) y AED A Fa) (f> fy excludes y) 
=(G, Vi, VUG hi) ei) (i fy excludes y) 
=(C Vi aa) 


We call the formula ((f} V f,) A fy) the resolution of the variable y in f. 
Note that this formula (specifically the subformula (fý v f} )) is not in CNF, but 
can be easily rewritten in CNF by distributing the clauses in f} over the clauses 
in f}. The equivalence of resolution and existential quantification then follows 
from Lemmas 1 and 2 above: 


Corollary 1. For a formula f and a boolean variable y, the formula (Jy) f is 
logically equivalent to (fy V fy) ^ fy) 


Proof. The claim follows from Lemmas 1 and 2. 


We represent fy, f,, fj by ZDDs by applying the Subset0 and Subset1 op- 
erations described in Section 2: Z} = Subsetl(Z,y), Z, = Subset1(Z, ~y), and 
Zi, = Subset0(Subset0(Z,y),-y). We then use the ClauseDistribution opera- 
tion to distribute the clauses of Zy over Z}, and the SubsumptionFree Union 
operation to combine all clauses into a single ZDD. This implements the op- 
eration Resolution(y;,Z) mentioned in Section 2. In practice, we follow the 
Cut-Elimination Algorithm of [4], which also eliminates tautologies by removing 
clauses where the same variable appears both positively and negatively. There- 
fore we can assume that the ZDD representations of fo,..., fn, do not include 
subsumed and tautological clauses, which may also lead to smaller ZDDs. 

The advantage of applying resolution symbolically over a ZDD representa- 
tion, rather than directly over the CNF formula is that every resolution step 
increases the number of clauses in the formula quadratically. Thus, the number 
of clauses after multiple resolution steps can easily grow exponentially. ZDDs, 
compared to representing clauses explicitly, are well-equipped for representing 
compactly large sets of clauses, often being able to represent an exponential set 
of clauses in polynomial space [16]. The ZDD representation also makes it easy 
to remove subsumed and tautological clauses, further reducing size. 


3.2 Full and Partial Realizability 


When the realizable set R is represented by a BDD, as in [9], it is easy to check 
whether R = Ø or R = B™, as this corresponds to the BDD being equal to 0 
or 1, respectively. This is less straightforward for a ZDD representation, which 
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expresses R indirectly by the set of clauses in its CNF representation fp, rather 
than the set of assignments itself. We say that a CNF specification f(%,%) is 
fully realizable if and only if all ä € B™ have some b € B” so that få, b) holds. 
This corresponds to R = B™. Similarly, we say that f is partially realizable if and 
only if there is some @ for which there exists some 6 so that f(āä,b) holds. This 
corresponds to R + Ø. After computing a ZDD representation of R, we wish to 
check full and partial realizability over this representation. 


Theorem 1. The CNF specification f(Z,¥) is fully realizable if and only if the 
ZDD for fo is equivalent to the 0-ZDD. 


Proof. The specification f(%,¥) is fully realizable if and only if the CNF for- 
mula fo representing R is a tautology, which means that every clause of R has 
both p and ~p for some variable p, i.e., every clause is a tautology. Tautologies, 
however, are automatically removed by the Resolution operation, as explained 
in Section 3.1. Thus, full realizability occurs if and only if the set of clauses is 
empty, represented by the ZDD 0. 


Note that the realizability R is represented by the CNF formula fo = (3y1)... 
(Syn) f, which does not contain any free occurrences of ¥ variables. We then per- 
form resolution on the Z variables in the same way as we did for the y variables. 
Then the original formula is partially realizable if and only if (3%,)(4z2)... 
(Srm) fo is true, meaning that resolution does not derive a contradiction. If a 
contradiction is derived, the resulting ZDD is the terminal 1, representing the 
empty clause. Otherwise it is the terminal 0. 


Theorem 2. The CNF specification f(Z,¥) is partially realizable if and only if 
the ZDD representing (421) (3x2)... (3£m) fo is equivalent to the 0-ZDD. 


Proof. Since all variables are existentially quantified, the ZDD must be either 
the terminal 0 (representing the empty CNF, equivalent to true) or the terminal 
1 (representing a CNF with an empty clause, equivalent to false). In the first 
case, the formula (3x1)... (3£m)(3y1)-.. (Syn) f is true, meaning that there 
is an assignment that satisfies f(%,y), which by definition makes f partially 
realizable. In the second case, the formula is false, meaning that there is no such 
assignment, and therefore f is not partially realizable. 


4 Synthesis Using ZDDs 


As described in [9], once we have computed the formulas f),..., fn with the out- 
put variables existentially quantified, we can construct the witness g; for variable 
yi from the formula filyı > g1]... [yi-1 > gi_1], after having computed the wit- 
nesses gi,---,gi-1 for the preceding variables. In [9], two witness functions were 
presented for variable y;: the default-1 witness fi[y1 > g1]... [yi-1 > gi-1|[yi > 
1] and the default-0 witness (=/f;)[y1 > g1]. -. [yi-1  gi-1 [yi > 0]. In this work, 
however, we additionally want to ensure that we maintain the CNF form of the 
specification after substituting g1,...,9;-1 into fi, to enable ZDD representation. 


72 Y. Lin et al. 


In this section we show how to construct and substitute witnesses so that the 
result remains in CNF. 

For ZDD-based algorithms, the iterated substitution approach requires more 
sophistication for the construction of the witnesses, compared to the iterated- 
substitution approach for BDDs. We solve this problem in Section 4.2. As in [9], 
the resulting witnesses guarantee that f(G,91(@),..-,9n(@)) = 1 for all realizable 
input assignments ae R. 


4.1 Witnesses for Single-Dimension Output Variable 


As in [9], we start by defining witnesses for the case when there is a single output 
variable: 


Lemma 3. Let f be a CNF formula over boolean variables £1,..., £m, Y. Then 
the formulas f, and ~f} are witnesses for the variable y. 


Proof. The realizability set, as defined in Section 3.1, is R = {å € B™ | (3y) f[è > 
a] = 1}. Thus, by Corollary 1, for all å € R 


(fy y fy) ^ folë e a] = 1. (1) 


Hence fj [%+ ä]=1 and either fj [ž = ä]=1 or fj [ë= a] =1. 

Now we want to show f(å,g(å)) = 1, i.e., fly g(%)][% > å] = 1, for both 
o(#) = fy and g(ë) = >f}. 

For g(%) = fj, since f = fy ^ (yv fy) ^ (Cy) v fj), we are left to show 
flu > FIE > a] = (fa (fy V ft) A (EE) V ADIE > a] = 1. By (1) we are 
only left to show ((~ f; ) v f} [Ë > ä] = 1, which follows from the left-hand side 
being a tautology. 

Similarly, for g(Z) = =f}, we need to show f(@,g9(4@)) = fly > (-fy) [Ze 
å] = 1. This is equivalent to showing that (f{A((-fy )v fy Aay fy )[é e @] = 1 
By (1) we are only left to show ((~ f} ) v fy )[@ > å] = 1, a tautology. 


Note that the witness f} is in CNF, while the witness ~f}, being the negation 
of a CNF formula, can be more easily represented in DNF. Note also that these 
witnesses do not correspond exactly to the default-1 and default-0 witnesses 
of [9], which would more specifically be equivalent to fy A fy and =(f} A fy), 
respectively. We choose the alternative witnesses because they contain fewer 
clauses, and thus are more likely to produce a more efficient ZDD representation. 


4.2 Preserve CNF by Equivalent Witnesses 


We now explain how to construct witnesses of multiple output variables. Let 
fn,---,fo be as defined in Section 3.1. We can then compute a witness for each 
yi iteratively, as in [9]. Using the f; witness from Lemma 3, for example, this 
means gi(%) = (fily1  gi]---[yi-1 > Geils where g; is the witness for variable 
Yi. 
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The substitution f;[y > g], however, is not necessarily in CNF. But Lemma 3 
requires that the formula is in CNF in order to extract the next witness. This 
means that we need to find a way to perform the substitution in a way that the 
result remains in CNF. 

Recall that, since our Resolution operation removes tautological clauses, each 
variable can only occur in positive or negative form in a clause, but not both. If 
the witness g is in CNF, e.g., g = f}, we can substitute this witness in a clause 
(yvlivlə v...) where y occurs in positive form. The result is a disjunction of the 
literals J) ,l2,... and the CNF g = (cli ^A cl2 A^ ...). By distribution, we can write 
this as an equivalent CNF ((cli vl; Vlgv...)A(claVli Vlgv...)A...). Likewise, if 
the witness g for y is in DNF, e.g., g = (4f,), then, after the substitution, every 
clause (sy V 1, Vlo v...) where y appears in negative form can be converted to 
the CNF (= (a (cli AclgA.. ))VaVigv...) = ((cly Aclg A...) VU Vigv...) = 
((cdivlivlov.. .) A (cl yvli vlv.. JA): 

The problem, therefore, is that if we want the result to be in CNF, CNF 
witnesses work well for positive occurrences, while DNF witnesses work well 
for negative occurrences. Thus, as long as we can find an efficient conversion 
between CNF formulas and their equivalent DNF formulas, we can ensure that 
the substitution formula fi[y > g] can be written as a CNF. For this purpose, 
we introduce the Cross operator from [14]. 


Definition 2 (Cross operation). 
Let S be a family of sets of literals. Then 


Cross(S) = Minimal{t | Vs; € S: tA si +Ø}, 


where 
Minimal(S) = {te S|VseS:sct>s=th}. 


Hence, Cross( S) is a family of sets of literals, such that every set t of literals in 
Cross(S) has at least a common literal with every set of literals in S. Moreover, 
every set t in Cross(S) is irredundant [14], meaning they are the smallest possible 
sets satisfying this property. 

Specifically, if S represents a given CNF f, where every set s; € S represents 
a clause and the elements of s; are the literals in that clause, then Cross(S) 
represents the set of smallest possible sets t such that t has at least one com- 
mon literal with every disjunctive clause of f. Equivalently speaking, Cross(S) 
collects all t such that every disjunctive clause is satisfied, i.e., it is a collec- 
tion of all irredundant sets of literals corresponding to irredundant assignments 
to variables. This further means Cross(S) is a collection of prime implicants 
of f [6,14], whose disjunction has been proved to be a DNF equivalent to the 
CNF f. Therefore, whenever a CNF is given, we can construct a set S of sets, 
where every set in S$ collects literals in a disjunctive clause of the CNF. Then 
Cross( S) returns a set of sets representing an equivalent DNF. Conversely, when 
interpreted as a DNF, Cross( S) is equivalent to S interpreted as a CNF. 

By the analysis above, we can extend Definition 2 of the Cross operation to 
CNF formulas: 
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Definition 3 (CNF Cross operation). Let f be a CNF formula clyA...Acly, 
where every cli = Veer, £ is a clause formed by the disjunction of a set of literals 
Li. Let S = {Ly,..., Lpg} be the representation of f as a family of sets. Then, 


Cross(f) = V A£ 


L'e Cross( S) leL’, 


Note that Cross( f) isa DNF formula. We can similarly define in an analogous 
way the Cross of a DNF formula as a CNF formula. We can verify that Cross(f) 
and f are equivalent: 


Lemma 4. For a CNF formula f, Cross(f) = f. 


Proof. By analysis above, the set Cross(S') includes elements Lis which are 
irredundant smallest sets that each has common literal with every set of literals 
in S. Therefore, every conjunction Azez: £ or cube, has common literal with every 
disjunctive clauses in CNF f, and thus every cube has the same boolean values 
under the same set of truth assignments as a prime implicant [6,21] of CNF f. 
Then it follows that the DNF Cross(f), as a disjunction of these conjunctions, 
is logically equivalent to the disjunction of all prime implicants of the CNF f, 
as proved by previous works [21]. 


Note that the same result also holds for DNF formulas, following from the 
fact that Cross(f) = f if and only if sCross(f) =-f. 

Now we aim to show how to construct witnesses one by one, why this con- 
struction is correct, and why this construction is viable. First, if we fix the 
witness gj = (f;),,, and substitute positive and negative occurrences with gj 
and Cross(g;) in the CNF formula f;, then the equivalence and CNF form of 
filyj > gj] can both be preserved. We use the following lemma: 


Lemma 5. Let f and g be given as CNF formulas. Then fly g] is equivalent 
to (gv fy) A (4Cross(g) v fy) A fy 


Proof. By Lemmas 1 and 4, fly + g] = (Cu y fy) 4 Cuy fy) 4 fly > g] = 
(gv fy A Cay fy) a fy = (gv fy) 4 (4Cross(g) v fy) A fy: 


Since g = f} is a CNF formula, Cross(g) is a DNF formula, and —Cross(q) 
is a CNF. By distribution of f} over clauses in g, and distribution of f} over 
clauses in sCross(g), the resulting expression (g v fy) A (=Cross(g) v f,) A fy 
can be converted to CNF form. 

Alternatively, we can pick the witness g = ~f} , and instead substitute Cross(q) 
on positive occurrences and g on negative occurrences of y. Similarly, the formula 
(Cross(g) v fy) ^ (=g V fy) ^ fy can also be converted to an equivalent CNF. 
Therefore, the equivalence and CNF form is preserved for fi[y; > g;], leading 
to the following corollary. 


Corollary 2. Every step in gi(%) = (filyr > 91].--[yi-1 > gi-1])y, can be per- 
formed so it returns a CNF formula. 
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Proof. Corollary 2 follows from Lemma 3, Definition 3, and Lemma 5 


Finally, we have the witnesses constructed in this process: 
Theorem 3. Let gi(%) = (filyı > g1]... [yi-1 > gi-1])y, for 0 <i< n. Then, 
gi is a witness for yi in f, for every yi. The same applies if gi(Z) = ~( filyı > 
gi) +++ [Yi-1 > Ga). 


Proof. Theorem 3 follows from Lemma 5 and Corollary 2. 


4.3 Algorithm for Constructing Witnesses 


In the last subsection we described how to uses Knuth’s Cross operation to 
facilitate CBF/DNF conversion, enabling the use of iterated substitution. We 
describe our novel algorithm for synthesis using ZDDs. 

We start by presenting the ZDD implementation of Cross function from 
Definition 2, following [14]: 


if ZDD Z is the 1-terminal then 

return 0-terminal; 

else if ZDD Z is the 0-terminal then 

return 1-terminal; 

else 

// Z denotes the ZDD rooted at 0-child of root of Z 
// Za denotes the ZDD rooted at 1-child of root of Z 
Zr = Union(Z,, Zn); 

Zu = Cross(Z,); 

Zr = Cross( Zi); 

Znn = Difference( Zr, Zu); 

// Var(Z) denotes the variable at the root node of Z 
Z' = NewZDD( Var(Z), Zu, Znn); 

return Z’; 


end 


We now explain how to perform the substitution following Lemma 5, where 
we want to construct a ZDD of f[y > g], where f and g are CNF formulas and 
y is a variable. Denote the ZDD representation of f as Zș and that of g as Zg. 
Then we compute the ZDD Cross(Z,) using the algorithm above. Recall that 
this ZDD represents a DNF formula that is equivalent to g. 

To construct a ZDD for the formula in Lemma 5, we need a ZDD for =Cross(g). 
But note that the ZDD for the CNF =Cross(g) is equal to the ZDD for the DNF 
Cross(g) except replacing every positive literal p with the its negative literal sp 
and vice-versa. Therefore, we want to swap p and =p in Cross(Z,). 

We retrieve the clauses with neither p nor =p by 


Zı = Subset0(Subset0( Cross( Z4), p), =p). 
Then we swap p with ~p in every clause where p appears positively: 


Z2 = Change(Subset1 ( Cross(Z,),p), =p). 
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And we swap ~p with p in every clause where p appears negatively: 
Z3 = Change(Subset1 ( Cross(Z4), =p), p). 


Finally, taking the union of Z1, Z2 and Z3 gives us the ZDD —=Cross(Z,) 
encoding the CNF for the negation of Cross(Z,). 

Let Z}, Z, and Z, be the ZDDs for f}, fọ and f}, respectively, constructed 
as described in Section 3.1. We compute the ZDDs for (gv fj) and (+Cross(g) v 
f,) by ClauseDistribution( Z4, Z} ) and ClauseDistribution(Cross(Zq), Z} ), re- 
spectively. We then take the Union of these two ZDDs and Zy to get the ZDD 
for (g v fy) ^ (=Cross(g) v f3) ^ fy, which is exactly the ZDD for f[y > g] by 


Lemma 5. 


5 Experimental Evaluations 


5.1 Experimental Methodology and and Setting 


We perform a comparison between our ZDD-based synthesizer, ZSynth, and 
the tool RSynth described by [9], using challenging M? benchmarks from the 
QBFEVAL 2016 data set [18], the latest QBFEVAL set that includes a 2QBF 
(forall-exists) track, which is the format our benchmarks require. Each bench- 
mark ran for 24 hours on Rice University’s NOTS cluster with 64G RAM size. We 
focus our comparison on the Fixpoint Detection, MutexP, and QShifter bench- 
mark families, omitting those families that are either too easy or too hard to 
solve for both tools, namely, the Tree, Ranking Functions, Reduction Finding, 
and Sorting Networks families [18]. For those families, either both tools solved 
all instances or none. Of these omitted benchmark families, Tree is very simple 
and is solved very quickly by both tools, while the others could be synthesized 
by neither tool. therefore we choose to focus on the three families that pro- 
vide an interesting comparison. Fixpoint Detection, MutexP and QShifter have, 
respectively, 146, 7, and 6 instances. 

For each tool we evaluate both total time and peak memory consumption 
for compilation, realizability, and synthesis, as well as the DD size for the orig- 
inal formula in each symbolic representation. We use the maximum cardinal- 
ity search (MCS) heuristic [23] to determine the ordering of variables in both 
ZDDs and BDDs.Due to restrictions on available time and space resources, some 
benchmarks show out-of-time and out-of-memory failures. We measure the per- 
formance of both tools on the benchmarks that are solved. The experimental 
evaluations conclude that the ZDD-based approach is complementary to the 
BDD-based approach. 


5.2 Compilation Time and Size of Diagram Representing Original 
Formula 


We first compare the performance of CNF compilation into ZDDs and BDDs, fol- 
lowing the first research question proposed in Section 1. The log-scale bar plot in 
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Fig. 1 presents compilation time for the benchmarks from the selection families, 
per Section 5.1. The size of the bars representing each formula is proportional 
to the compilation time. 

The compilation into a ZDD takes polynomial (at most quadratic) time, 
because paths in the ZDD correspond to clauses, and therefore the size of the 
ZDD is always linear in the size of the formula. In contrast, the compilation into 
a BDD can be exponential, because paths in a BDD correspond to assignments, 
and therefore the number of paths can be exponential. The advantage of ZDDs as 
a compact representation is consistent with our conjecture. Across all benchmark 
families in QBFEVAL’16, compilation into ZDDs takes less time and space than 
BDDs in most cases. 

It is worth noting that we construct here the ZDD representation of the CNF 
formulas by adding one clause at a time using the Union operator. Compilation 
could be further optimized by using a divide-and-conquer approach, where we 
split the set of clauses in half, construct ZDDs for each half recursively, and then 
take their union. 
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Fig. 1: Compilation time of the CNF: red = BDD, blue = ZDD 


5.3 Realizability Time 


The plot in Fig. 2 summarizes for each family the time spent on constructing 
the realizability set and checking partial and full realizability. The dashed lines 
in red illustrate RSynth results, while the solid lines in dark blue with the same 
shapes show ZSynth results. As each solvers have the families where it has an 
advantage in, we note how many instances of each family each solver is able to 
solve within a given time. We include data for all benchmarks that completed the 
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Percentage Realizability Solves vs. Time Passed By 
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Fig. 2: Percentage solved for realizability within a given timeout. Dashed red = 
BDD, solid blue = ZDD. 


realizability phase. The graph plots the percentage of benchmarks in each family 
that RSynth and ZSynth complete for a given timeout, with 100% meaning that 
all instances of that family were solved. 

We see from Fig. 2 that RSynth solves more cases of the Fixpoint Detection 
family, and does so faster than ZSynth. Most of the cases it solves are completed 
in under 10ms. On the other hand, ZSynth has the advantage in the QShifter 
and MutexP families, for which it is able to solve more cases in a shorter time. 
Therefore, ZSynth and RSynth each performs better on different families of 
benchmarks. This allows us to answer the second research question proposed 
in Section 1 with the observation that neither approach dominates across the 
board, rather realizability performance is dependent on the benchmark family. 
As we see below in Section 5.4, these general results also extend to end-to-end 
synthesis. 


5.4 End-to-End Time and Peak Memory 


Our observations for end-to-end synthesis time—including compilation, realiz- 
ability, and witness construction—are plotted in Fig. 3. Similarly to realizability 
time, the total end-to-end synthesis time shows strongly family-dependent re- 
sults. Both ZSynth and RSynth display better relative performance on the same 
families as they did for realizability. In families where ZSynth solves more in- 
stances, including QShifter and MutexP, ZSynth also takes less time in most 
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Percentage Completed End-to-End vs. Time Passed By 
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Fig. 3: Percentage solved end-to-end within a given timeout. Dashed red = BDD, 
solid blue = ZDD. 


cases, and vice-versa for those families where RSynth solves more benchmarks 
end-to-end. 

We observed in our experiments that memory and time were generally cor- 
related, meaning that benchmarks that took more time also consumed more 
memory. This is expected when dealing with algorithms based on decision dia- 
grams, since the biggest factor that impacts the performance of such algorithms 
is diagram size. In practice, memory comparison between RSynth and ZSynth 
in compilation, realizability and witness construction have similar patterns as 
the time comparison. Even if ZDDs have an advantage in representing the initial 
specification, the overall memory consumption for realizability and synthesis is, 
similarly to running time, largely dependent on the benchmark family. 


5.5 Scalable Benchmarks Show ZDD has Slower Growing Demands 
of Time and Space 


To analyze the scalability of ZDDs in relation to BDDs, as per the fifth research 
question in Section 1, we take a closer look at the running time and node counts of 
ZSynth and RSynth in the benchmarks of the QShifter family. All benchmarks in 
this family follow the same structure, just scaled based on a numerical parameter. 
For a parameter n, qshifter_n has 2?”*! clauses, 2" +n input variables and 2” 
output variables, so we expect to see exponential trends in the measured values. 

The results can be found in Fig. 4, which considers only QShifter because it 
can be scaled based on a parameter, and RSynth did not solve enough instances 
of MutexP to have an interesting scalability comparison. Since RSynth solves 
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only up to the smallest instances in the QShifter family, we use the maximal 
time limit, illustrated by the “X” in the plot not connected to any line, as a con- 
servative underestimation for the running time of further instances. (Therefore, 
the compilation, realizability and end-to-end times for RSynth in qshifter_5 
must be higher than the “X” mark.) As QShifter benchmarks are regular in their 
constructions, we can observe the trend of the exponent. 

The results for RSynth, both for time and number of nodes, always has a 
steeper slope in the parameter n. Since the graph is in log scale, straight lines 
represent an exponential increase, and the slope represents the coefficient of the 
exponent. Therefore, although both ZSynth and RSynth grow exponentially, in 
both time and space, ZSynth is more efficient by an exponential factor. 

These results suggest that there are families for which we can expect ZDD 
synthesizers to require significantly fewer resources in time and space as the size 
of the formulas grows. The QShifter family is one example of a family where the 
ZDD algorithm performs better by an exponential factor. 
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Fig. 4: Scalable family evaluations: dashed red = BDD, solid blue = ZDD. 


5.6 Overall Comparison 


As explained in Section 5.1, we focus on evaluating the synthesizers on the 
Fixpoint Detection, QShifter, and MutexP families of benchmarks [18]. ZSynth 
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Table 1: Percentage of end-to-end completed instances in each family. 


Benchmark Family Name | RSynth (BDD) | ZSynth (ZDD) 
Fixpoint Detection 30.82% 20.55% 
MutexP 14.29% 42.86% 
QShifter (scalable) 28.57% 100% 


shows clear time and space advantages on the MutexP and QShifter families, 
while RSynth performs better in the Fixpoint Detection family. In Table 1, we 
show how much of each family either tool was able to solve. 

Next, we summarize the overall results of our experimental performance com- 
parison. In families where ZDD completed more instances end-to-end, we can see 
that ZDD has better performance in all bases of comparison, including compi- 
lation, realizability, and end-to-end time, as well as diagram node count for the 
original formula and peak node count. Additionally, Section 5.5 shows that there 
exist families of scalable benchmarks for which the time and space demands of 
ZDDs grow more slowly than BDDs by an exponential factor, as illustrated by 
the smaller slope in Fig. 4. 

Even in the Fixpoint Detection family, where BDDs solve more instances, 
ZDDs show advantages in compilation time, initial diagram size, and smaller 
scaling slopes in time and space. In realizability and overall synthesis perfor- 
mance, neither our ZDD-based algorithm nor the BDD-based algorithm dom- 
inates across the board, each performing better in those families where it can 
solve more instances. 


6 Conclusion 


We conclude that ZDD-based algorithms are competitive with those based on 
BDDs, and both have their place in a portfolio of solvers for boolean synthe- 
sis. Since both BDDs and ZDDs can be converted to circuits, we advocate 
that an industrial solver would benefit from both approaches. In CNF-specified 
boolean-synthesis problems, BDD and ZDD are orthogonal approaches, and cir- 
cumstances exist where each one of the solvers shows leading performance. For 
this type of problems, our portfolio advocates a multi-engine approach that is 
inclusive of both approaches. 

As most tools for QBF solving and synthesis solving handle the input formula 
monolithically, future research based on this work includes an exploration of 
partitioning of variables [8] and factored synthesis [23] in the context of ZDDs. 
Another direction is to explore the usage of ZDD-based techniques in the context 
of temporal synthesis, cf. [24]. 
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Abstract. Digital mathematical libraries assemble the knowledge of years 
of mathematical research. Numerous disciplines (e.g., physics, engineering, 
pure and applied mathematics) rely heavily on compendia gathered findings. 
Likewise, modern research applications rely more and more on computational 
solutions, which are often calculated and verified by computer algebra systems. 
Hence, the correctness, accuracy, and reliability of both digital mathematical 
libraries and computer algebra systems is a crucial attribute for modern 
research. In this paper, we present a novel approach to verify a digital math- 
ematical library and two computer algebra systems with one another by 
converting mathematical expressions from one system to the other. We use 
our previously developed conversion tool (referred to as Cas) to translate 
formulae from the NIST Digital Library of Mathematical Functions to the 
computer algebra systems Maple and Mathematica. The contributions of 
our presented work are as follows: (1) we present the most comprehensive 
verification of computer algebra systems and digital mathematical libraries 
with one another; (2) we significantly enhance the performance of the un- 
derlying translator in terms of coverage and accuracy; and (3) we provide 
open access to translations for Maple and Mathematica of the formulae in 
the NIST Digital Library of Mathematical Functions. 


Keywords: Presentation to Computation, LaCASt, LaTeX, Semantic La- 
TeX, Computer Algebra Systems, Digital Mathematical Library 


1 Introduction 


Digital Mathematical Libraries (DML) gather the knowledge and results from thou- 
sands of years of mathematical research. Even though pure and applied mathematics 
are precise disciplines, gathering their knowledge bases over many years results in 
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issues which every digital library shares: consistency, completeness, and accuracy. 
Likewise, Computer Algebra Systems (CAS)® play a crucial role in the modern era 
for pure and applied mathematics, and those fields which rely on them. CAS can 
be used to simplify, manipulate, compute, and visualize mathematical expressions. 
Accordingly, modern research regularly uses DML and CAS together. Nonetheless, 
DML [7,14] and CAS [1,20,11] are not exempt from having bugs or errors. Durán 
et al. [11] even raised the rather dramatic question: “can we trust in [CAS]?” 

Existing comprehensive DML, such as the Digital Library of Mathematical Func- 
tions (DLMF) [10], are consistently updated and frequently corrected with errata®. 
Although each chapter of the DLMF has been carefully written, edited, validated, 
and proofread over many years, errors still remain. Maintaining a DML, such as the 
DLMF, is a laborious process. Likewise, CAS are eminently complex systems, and in 
the case of commercial products, often similar to black boxes in which the magic (i.e., 
the computations) happens in opaque private code [11]. CAS, especially commercial 
products, are often exclusively tested internally during development. 

An independent examination process can improve testing and increase trust in the 
systems and libraries. Hence, we want to elaborate on the following research question. 


How can digital mathematical libraries and computer algebra systems be utilized 
to improve and verify one another? 


Our initial approach for answering this question is inspired by our previous studies 
on translating DLMF equations to CAS [7]. In order to verify a translation tool from 
a specific ATX dialect to Maple!?. , we performed symbolic and numeric evaluations 
on equations from the DLMF. Our approach presumes that a proven equation in a 
DML must be also valid in a CAS. In turn, a disparity in between the DML and 
CAS would lead to an issue in the translation process. However, assuming a correct 
translation, a disparity would also indicate an issue either in the DML source or the 
CAS implementation. In turn, we can take advantage of the same approach to improve 
and even verify DML with CAS and vice versa. Unfortunately, previous efforts to 
translate mathematical expressions from various formats, such as EXTRX [8,14,29], 
MATHML [31], or OpenMath [18,30], to CAS syntax have shown that the translation 
will be the most critical part of this verification approach. 

In this paper, we elaborate on the feasibility and limitations of the translation 
approach from DML to CAS as a possible answer to our research question. We 
further focus on the DLMF as our DML and the two general-purpose CAS Maple 
and Mathematica for this first study. This relatively sharp limitation is necessary in 
order to analyze the capabilities of the underlying approach to verify commercial CAS 


8 In the sequel, the acronyms CAS and DML are used, depending on the context, inter- 
changeably with their plurals. 
° https: //d1mf .nist.gov/errata/ [accessed 09/01/2021] 

10 The mention of specific products, trademarks, or brand names is for purposes of iden- 
tification only. Such mention is not to be interpreted in any way as an endorsement 
or certification of such products or brands by the National Institute of Standards and 
Technology, nor does it imply that the products so identified are necessarily the best 
available for the purpose. All trademarks mentioned herein belong to their respective 
owners. 
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and large DML. The DLMF uses semantic macros internally in order to disambiguate 
mathematical expressions [27,35]. These macros help to mitigate the open issue 
of retrieving sufficient semantic information from a context to perform translations 
to formal languages [31,14]. Further, the DLMF and general-purpose CAS have a 
relatively large overlap in coverage of special functions and orthogonal polynomials. 
Since many of those functions play a crucial role in a large variety of different research 
fields, we focus in this study mainly on these functions. Lastly, we will take our 
previously developed translation tool IACasT [8,14] as the baseline for translations 
from the DLMF to Maple. In this successor project, we focus on improving /CasT 
to minimize the negative effect of wrong translations as much as possible for our 
study. In the future, other DML and CAS can be improved and verified following 
the same approach by using a different translation approach depending on the data 
of the DML, e.g., MATHML [31] or OpenMath [18]. 

In particular, in this paper, we fix the majority of the remaining issues of ACasT [7], 
which allows our tool to translate twice as many expressions from the DLMF to the 
CAS as before. Current extensions include the support for the mathematical opera- 
tors: sum, product, limit, and integral, as well as overcoming semantic hurdles associ- 
ated with Lagrange (prime) notations commonly used for differentiation. Further, we 
extend its support to include Mathematica using the freely available Wolfram Engine 
for Developers (WED)! (hereafter, with Mathematica, we refer to the WED). These 
improvements allow us to cover a larger portion of the DLMF, increase the reliability 
of the translations via !ACasT, and allow for comparisons between two major general- 
purpose CAS for the first time, namely Maple and Mathematica. Finally, we provide 
open access to all the results contained within this paper, including all translations 
of DLMF formulae, an endpoint to JACasT'!?, and the full source code of AC\sT!°. 

The paper is structured as follows. Section 2 explains the data in the DLMF. 
Section 3 focus on the improvements of IACasT that had been made to make the trans- 
lation as comprehensive and reliable as possible for the upcoming evaluation. Section 4 
explains the symbolic and numeric evaluation pipeline. Since Cohl et al. [7] only briefly 
sketched the approach of a numeric evaluation, we will provide an in-depth discussion 
of that process in Section 4. Subsequently, we analyze the results in Section 5. Finally, 
we conclude the findings and provide an outlook for upcoming projects in Section 6. 


1.1 Related Work 


Existing verification techniques for CAS often focus on specific subroutines or func- 
tions [26,20,5,12,6,25,21,17], such as a specific theorems [23], differential equations [19], 
or the implementation of the math.h library [24]. Most common are verification ap- 
proaches that rely on intermediate verification languages [5,20,21,19,17], such as 
Boogie [25,2] or Why3 [21,4], which, in turn, rely on proof assistants and theorem 
provers, such as Cog [5,3], Isabelle [19,28], or HOL Light [20,16,17]. Kaliszyk and 
Wiedijk [20] proposed on entire new CAS which is built on top of the proof assis- 
tant HOL Light so that each simplification step can be proven by the underlying 
1 nttps://www.wolfram.com/engine/ [accessed 09/01/2021] 


1? nttps://lacast.wmflabs.org/ [accessed 01/01/2022] 
13 nttps://github.com/ag-gipp/LaCASt [accessed 04/01/2022] 
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architecture. Lewis and Wester [26] manually compared the symbolic computations 
on polynomials and matrices with seven CAS. Aguirregabiria et al. [1] suggested to 
teach students the known traps and difficulties with evaluations in CAS instead to 
reduce the overreliance on computational solutions. 

Cohl et al. [7] developed the aforementioned translation tool ESCasT, which trans- 
lates expressions from a semantically enhanced BTFX dialect to Maple. By evaluating 
the performance and accuracy of the translations, we were able to discover a sign-error 
in one the DLMF’s equations [7]. While the evaluation was not intended to verify 
the DLMF, the translations by the rule-based translator IAC\sT provided sufficient 
robustness to identify issues in the underlying library. To the best of our knowledge, 
besides this related evaluation via 4CasT, there are no existing libraries or tools that 
allow for automatic verification of DML. 


2 The DLMF dataset 


In the modern era, most mathematical texts (handbooks, journal publications, mag- 
azines, monographs, treatises, proceedings, etc.) are written using the document 
preparation system TEX. However, the focus of ATEX is for precise control of the 
rendering mechanics rather than for a semantic description of its content. In contrast, 
CAS syntax is coercively unambiguous in order to interpret the input correctly. Hence, 
a transformation tool from DML to CAS must disambiguate mathematical expres- 
sions. While there is an ongoing effort towards such a process [32,22,34,13,36,33], 
there is no reliable tool available to disambiguate mathematics sufficiently to date. 
The DLMF contains numerous relations between functions and many other 
properties. It is written in BTEX but uses specific semantic macros when applicable [35]. 
These semantic macros represent a unique function or polynomial defined in the DLMF. 
Hence, the semantic PTFX used in the DLMF is often unambiguous. For a successful 
evaluation via CAS, we also need to utilize all requirements of an equation, such as 
constraints, domains, or substitutions. The DLMF provides this additional data too 
and generally in a machine-readable form [35]. This data is accessible via the i-boxes 
(information boxes next to an equation marked with the icon ©). If the information 
is not given in the attached i-box or the information is incorrect, the translation via 
TFACAsT would fail. The i-boxes, however, do not contain information about branch cuts 
(see Section B) or constraints. Constraints are accessible if they are directly attached 
to an equation. If they appear in the text (or even a title), !A4CasT cannot utilize them. 
The test dataset, we are using, was generated from DLMF Version 1.1.3 (2021-09-15) 
and contained 9,977 formulae with 1,505 defined symbols, 50,590 used symbols, 2,691 
constraints, and 2,443 warnings for non-semantic expressions, i.e., expressions without 
semantic macros [35]. Note that the DLMF does not provide access to the underlying 
PTEX source. Therefore, we added the source of every equation to our result dataset. 


3 Semantic PITFX to CAS translation 


The aforementioned translator 4CasT was developed by Cohl and Greiner-Petter et 
al. [8,7,14]. They reported a coverage of 58.8% translations for a manually selected 
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part of the DLMF to the CAS Maple. This version of ACasT serves as a baseline 
for our improvements. In order to verify their translations, they used symbolic and 
numeric evaluations and reported a success rate of ~16% for symbolic and ~12% 
for numeric verifications. 

Evaluating the baseline on the entire DLMF result in a coverage of only 31.6%. 
Hence, we first want to increase the coverage of !ACasT on the DLMF. To achieve this 
goal, we first increasing the number of translatable semantic macros by manually defin- 
ing more translation patterns for special functions and orthogonal polynomials. For 
Maple, we increased the number from 201 to 261. For Mathematica, we define 279 new 
translation patterns which enables 4CasT to perform translations to Mathematica. 
Even though the DLMF uses 675 distinguished semantic macros, we cover ~70% of 
all DLMF equations with our extended list of translation patterns (see Zipfs law for 
mathematical notations [15]). In addition, we implemented rules for translations that 
are applicable in the context of the DLMF, e.g., ignore ellipsis following floating-point 
values or \choose always refers to a binomial expression. Finally, we tackle the remain- 
ing issues outlined by Cohl et al. [7] which can be categorized into three groups: (i) 
expressions of which the arguments of operators are not clear, namely sums, products, 
integrals, and limits; (ii) expressions with prime symbols indicating differentiation; and 
(iii) expressions that contain ellipsis. While we solve some of the cases in Group (iii) by 
ignoring ellipsis following floating-point values, most of these cases remain unresolved. 
In the following, we elaborate our solutions for (i) in Section 3.1 and (ii) in Section 3.2. 


3.1 Parse sums, products, integrals, and limits 


Here we consider common notations for the sum, product, integral, and limit operators. 
For these operators, one may consider mathematically essential operator metadata 
(MEOM). For all these operators, the MEOM includes argument(s) and bound vari- 
able(s). The operators act on the arguments, which are themselves functions of the 
bound variable(s). For sums and products, the bound variables are referred to as 
indices. The bound variables for integrals‘ are called integration variables. For limits, 
the bound variables are continuous variables (for limits of continuous functions) and 
indices (for limits of sequences). For integrals, MEOM include precise descriptions of 
regions of integration (e.g., piecewise continuous paths/intervals/regions). For limits, 
MEOM include limit points (e.g., points in R” or C” for n€N), as well as information 
related to whether the limit to the limit point is independent or dependent on the 
direction in which the limit is taken (e.g., one-sided limits). 

For a translation of mathematical expressions involving the XT fX commands 
\sum, \int, \prod, and \lim, we must extract the MEOM. This is achieved by (a) 
determining the argument of the operator and (b) parsing corresponding subscripts, 
superscripts, and arguments. For integrals, the MEOM may be complicated, but cer- 
tainly contains the argument (function which will be integrated), bound (integration) 
variable(s) and details related to the region of integration. Bound variable extraction 
is usually straightforward since it is usually contained within a differential expression 
14 The notion of integrals includes: antiderivatives (indefinite integrals), definite integrals, 


contour integrals, multiple (surface, volume, etc.) integrals, Riemannian volume integrals, 
Riemann integrals, Lebesgue integrals, Cauchy principal value integrals, etc. 
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(infinitesimal, pushforward, differential 1-form, exterior derivative, measure, etc.), 
e.g., dx. Argument extraction is less straightforward since even though differential 
expressions are often given at the end of the argument, sometimes the differential 
expression appears in the numerator of a fraction (e.g., f f wien), In which case, the 
argument is everything to the right of the \int (neglecting its subscripts and super- 
scripts) up to and including the fraction involving the differential expression (which 
may be replaced with 1). In cases where the differential expression is fully to the right 
of the argument, then it is a termination symbol. Note that some scientists use an 
alternate notation for integrals where the differential expression appears immediately 
to the right of the integral, e.g., {dxf (x). However, this notation does not appear 
in the DLMF. If such notations are encountered, we follow the same approach that 
we used for sums, products, and limits (see Section 3.1). 


Extraction of variables and corresponding MEOM The subscripts and super- 
scripts of sums, products, limits, and integrals may be different for different notations 
and are therefore challenging to parse. For integrals, we extract the bound (integra- 
tion) variable from the differential expression. For sums and products, the upper 
and lower bounds may appear in the subscript or superscript. Parsing subscripts is 
comparable with the problem of parsing constraints [7] (which are often not consis- 
tently formulated). We overcame this complexity by manually defining patterns of 
common constraints and refer to them as blueprints. This blueprint pattern approach 
allows JACAsT to identify the MEOM in the sub- and superscripts. A more detailed 
explanations with examples about the blueprints is available in the Appendix AŻ. 


Identification of operator arguments Once we have extracted the bound variable 
for sums, products, and limits, we need to determine the end of the argument. We 
analyzed all sums in the DLMF and developed a heuristic that covers all the formulae 
in the DLMF and potentially a large portion of general mathematics. Let x be the 
extracted bound variable. For sums, we consider a summand as a part of the argument 
if (I) it is the very first summand after the operation; or (II) x is an element of the 
current summand; or (III) x is an element of the following summand (subsequent 
to the current summand) and there is no termination symbol between the current 
summand and the summand which contains x with an equal or lower depth according 
to the parse tree (i.e., closer to the root). We consider a summand as a single logical 
construct since addition and subtraction are granted a lower operator precedence than 
multiplication in mathematical expressions. Similarly, parentheses are granted higher 
precedence and, thus, a sequence wrapped in parentheses is part of the argument if 
it obeys the rules (I-III). Summands, and such sequences, are always entirely part 
of sums, products, and limits or entirely not. 

A termination symbol always marks the end of the argument list. Termination 
symbols are relation symbols, e.g., =, Æ, <, closing parentheses or brackets, e.g., 
), ], or >, and other operators with MEOMs, if and only if, they define the same 
bound variable. If x is part of a subsequent operation, then the following operator 


15 The Appendix is available at https: //arxiv. org/abs/2201 .09488. 
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is considered as part of the argument (as in (II)). However, a special condition for 
termination symbols is that it is only a termination symbol for the current chain of 
arguments. Consider a sum over a fraction of sums. In that case, we may reach a 
termination symbol within the fraction. However, the termination symbol would be 
deeper inside the parse tree as compared to the current list of arguments. Hence, we 
used the depth to determine if a termination symbol should be recognized or not. 
Consider an unusual notation with the binomial coefficient as an example 


SC) Don, (1 


k 
k=0 k=0 Lnn] Imam 


w- 


This equation contains two termination symbols, marked a 
red and green. The red termination symbol = is obviously n= jt 2 
for the first sum on the left-hand side of the equation. The 
2 ; y N c 

green termination symbol [| terminates the product to the D 
left because both products run over the same bound variable 
m. In addition, none of the other = signs are termination oe ictn? |+N 
symbols for the sum on the right-hand side of the equation 
because they are deeper in the parse tree and thus do not Eia) 
terminate the sum. 

Note that varN in the blueprints also matches multiple (5>™_ n+ Sey 
bound variable, e.g., S kE a- In such cases, x from above 
is a list of bound variables and a summand is part of the N be 
argument if one of the elements of x is within this summand. DE Si 
Due to the translation, the operation will be split into two Fig.1: Example argu- 
preceding operations, i.e., ) >m pea becomes mea? kea: ment identifications for 
Figure 1 shows the extracted arguments for some example sums. 
sums. The same rules apply for extraction of arguments for 
products and limits. 


3.2 Lagrange’s notation for differentiation and derivatives 


Another remaining issue is the Lagrange (prime) notation for differentiation, since it 
does not outwardly provide sufficient semantic information. This notation presents 
two challenges. First, we do not know with respect to which variable the differentiation 
should be performed. Consider for example the Hurwitz zeta function ¢(s,a) [10, 
§25.11]. In the case of a differentiation ¢’(s,a), it is not clear if the function should be 
differentiated with respect to s or a. To remedy this issue, we analyzed all formulae 
in the DLMF which use prime notations and determined which variables (slots) for 
which functions represent the variables of the differentiation. Based on our analysis, we 
extended the translation patterns by meta information for semantic macros according 
to the slot of differentiation. For instance, in the case of the Hurwitz zeta function, 
the first slot is the slot for prime differentiation, i.e., ¢/(s,a) = 4¢(s,a). The identified 
variables of differentiations for the special functions in the DLMF can be considered 
to be the standard slots of differentiations, e.g., in other DML, ¢’(s,a) most likely 
refers to 4 ¢(s,a). 
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The second challenge occurs if the slot of differentiation contains complex expres- 
sions rather than single symbols, e.g., ¢’(s?,a). In this case, ¢’(s?,a) = ay s(a) 
instead of 4c (s?,a). Since CAS often do not support derivatives with respect to 
complex expressions, we use the inbuilt substitution functions!® in the CAS to over- 
come this issue. To do so, we use a temporary variable temp for the substitution. 
CAS perform substitutions from the inside to the outside. Hence, we can use the 
same temporary variable temp even for nested substitutions. Table 1 shows the 
translation performed for ¢’(s?,a). CAS may provide optional arguments to calculate 
the derivatives for certain special functions, e.g., Zeta(n,z,a) in Maple for the n-th 
derivative of the Hurwitz zeta function. However, this shorthand notation is generally 
not supported (e.g., Mathematica does not define such an optional parameter). Our 
substitution approach is more lengthy but also more reliable. Unfortunately, lengthy 
expressions generally harm the performance of CAS, especially for symbolic manipula- 
tions. Hence, we have a genuine interest in keeping translations short, straightforward 
and readable. Thus, the substitution translation pattern is only triggered if the 
variable of differentiation is not a single identifier. Note that this substitution only 
triggers on semantic macros. Generic functions, including prime notations, are still 
skipped. 


A related problem to MEOM of Table 1: Example translations for the prime 
sums, products, integrals, limits, and derivative of the Hurwitz zeta function with 
differentiations are the notations of respect to s2. 
derivatives. The semantic macro for 
derivatives \deriv{w}{x} (rendered System Ç (s?a) 
as e) is often used with an empty DLMF | \Hurwitzzeta’@{s72}{a} 
first argument to render the function Maple o] subs (temp=(s) ~O) ,diff( 
behind the derivative notation, e.g., 

\deriv{}{x}\sin@{x} for $ sing. .------{--- 2 edie pete tas aes 
This leads to the same problem we Mathe- D[HurwitzZeta[temp,a], 

faced above for identifying MEOMs. matica {temp , 1}]/.temp->(s)~ (2) 

In this case, we use the same heuris- 

tic as we did for sums, products, and limits. Note that derivatives may be written 
following the function argument, e.g., sin(z)4. If we are unable to identify any 
following summand that contains the variable of differentiation before we reach a 
termination symbol, we look for arguments prior to the derivative according to the 


heuristic (I-III). 


Wronskians With the support of prime differentiation described above, we are 
also able to translate the Wronskian [10, (1.13.4)] to Maple and Mathematica. A 
translation requires one to identify the variable of differentiation from the elements 
of the Wronskian, e.g., z for W{Ai(z),Bi(z)} from [10, (9.2.7)]. We analyzed all 
Wronskians in the DLMF and discovered that most Wronskians have a special 


16 Note that Maple also support an evaluation substitution via the two-argument eval 
function. Since our substitution only triggers on semantic macros, we only use subs if the 
function is defined in Maple. In turn, as far as we know, there is no practical difference 
between subs and the two-argument eval in our case. 
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function in its argument—such as the example above. Hence, we can use our previously 
inserted metadata information about the slots of differentiation to extract the variable 
of differentiation from the semantic macros. If the semantic macro argument is a 
complex expression, we search for the identifier in the arguments that appear in both 
elements of the Wronskian. For example, in W{Ai(z*),¢(z7,a)}, we extract z as the 
variable since it is the only identifier that appears in the arguments z* and 2? of the 
elements. This approach is also used when there is no semantic macro involved, i.e., 
from W {2,27} we extract z as well. If IACasT extracts multiple candidates or none, 
it throws a translation exception. 


4 Evaluation of the DLMF using CAS 


l Digital Library of Mathematical Functions 


| Constraints | Constraint Blueprints 
p RS 


! 
(e mal ok 


(7) binomial n: nonnegative !: factorial 
k/ coefficient integer k: integer 


Case Analyzer 


=> Workflow LCase Filter 


Substitutions lel BER) 
=== Constraints LaCASt Numeric Test 


== Success 1,910 (= 28.9%) Translator Value Filter 


= Failure p! 
* 1,084 ( 26.3%) 


É Maple ; ; 
OMaplesofe linc. — | Symbolic Numeric 
Mathematica Evaluator Evaluator 


©Wolfram Research, Inc. 


1,357 (= 51.8%) 


1,784 (~ 51.4%) 


698 (= 26.7%) 


784 (= 22.6%) 


Fig. 2: The workflow of the evaluation engine and the overall results. Errors and 
abortions are not included. The generated dataset contains 9,977 equations. In total, 
the case analyzer splits the data into 10,930 cases of which 4,307 cases were filtered. 
This sums up to a set of 6,623 test cases in total. 


For evaluating the DLMF with Maple and Mathematica, we follow the same 
approach as demonstrated in [7], i.e., we symbolically and numerically verify the 
equations in the DLMF with CAS. If a verification fails, symbolically and numerically, 
we identified an issue either in the DLMF, the CAS, or the verification pipeline. 
Note that an issue does not necessarily represent errors/bugs in the DLMF, CAS, 
or ACasT (see the discussion about branch cuts in Section B). Figure 2 illustrates 
the pipeline of the evaluation engine. First, we analyze every equation in the DLMF 
(hereafter referred to as test cases). A case analyzer splits multiple relations in a single 
line into multiple test cases. Note that only the adjacent relations are considered, 
i.e., with f(z)=g9(z)=h(z), we generate two test cases f(z)=g(z) and g(z) =h(z) 
but not f(z)=h(z). In addition, expressions with + and F are split accordingly, e.g., 
i**=eF*/? (10, (4.4.12)] is split into i+? =e-7/? and i-*=et*/?. The analyzer utilizes 
the attached additional information in each line, i.e., the URL in the DLMF, the 
used and defined symbols, and the constraints. If a used symbol is defined elsewhere 
in the DLMF, it performs substitutions. For example, the multi-equation [10, (9.6.2)] 
is split into six test cases and every ¢ is replaced by 32°/? as defined in [10, (9.6.1). 
The substitution is performed on the parse tree of expressions [14]. A definition is 
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only considered as such, if the defining symbol is identical to the equation’s left-hand 
side. That means, z = (3¢)?/? [10, (9.6.10)] is not considered as a definition for ¢. 
Further, semantic macros are never substituted by their definitions. Translations for 
semantic macros are exclusively defined by the authors. For example, the equation [10, 
(11.5.2)] contains the Struve K,(z) function. Since Mathematica does not contain 
this function, we defined an alternative translation to its definition H,(z)—Y,(z) in 
[10, (11.2.5)] with the Struve function H,(z) and the Bessel function of the second 
kind Y,(z), because both of these functions are supported by Mathematica. The 
second entry in Table 3 in the Appendix D shows the translation for this test case. 


Next, the analyzer checks for additional constraints defined by the used symbols 
recursively. The mentioned Struve K,(z) test case [10, (11.5.2)] contains the Gamma 
function. Since the definition of the Gamma function [10, (5.2.1)] has a constraint 
Rz >0, the numeric evaluation must respect this constraint too. For this purpose, 
the case analyzer first tries to link the variables in constraints to the arguments 
of the functions. For example, the constraint Rz >Q sets a constraint for the first 
argument z of the Gamma function. Next, we check all arguments in the actual test 
case at the same position. The test case contains '(v+1/2). In turn, the variable z 
in the constraint of the definition of the Gamma function Rz>0 is replaced by the 
actual argument used in the test case. This adds the constraint R(v+1/2)>0 to the 
test case. This process is performed recursively. If a constraint does not contain any 
variable that is used in the final test case, the constraint is dropped. 


In total, the case analyzer would identify four additional constraints for the test 
case [10, (11.5.2)]. Table 3 in the Appendix D shows the applied constraints (including 
the directly attached constraint Rz >0 and the manually defined global constraints 
from Figure 3). Note that the constraints may contain variables that do not appear 
in the actual test case, such as #v+k+1>0. Such constraints do not have any effect 
on the evaluation because if a constraint cannot be computed to true or false, the 
constraint is ignored. Unfortunately, this recursive loading of additional constraints 
may generate impossible conditions in certain cases, such as |I (iy)| [10, (5.4.3)]. There 
are no valid real values of y such that (iy) >0. In turn, every test value would be 
filtered out, and the numeric evaluation would not verify the equation. However, such 
cases are the minority and we were able to increase the number of correct evaluations 
with this feature. 


To avoid a large portion of incorrect calculations, the analyzer filters the dataset 
before translating the test cases. We apply two filter rules to the case analyzer. First, 
we filter expressions that do not contain any semantic macros. Due to the limitations 
of CasT’, these expressions most likely result in wrong translations. Further, it filters 
out several meaningless expressions that are not verifiable, such as z = x in [10, 
(4.2.4)]. The result dataset flag these cases with ‘Skipped - no semantic math’. Note 
that the result dataset still contains the translations for these cases to provide a 
complete picture of the DLMF. Second, we filter expressions that contain ellipsis!” 
(e.g., \cdots), approximations, and asymptotics (e.g., O(z7)) since those expressions 
cannot be evaluated with the proposed approach. Further, a definition is skipped if it is 
not a definition of a semantic macro, such as [10, (2.3.13)], because definitions without 


17 Note that we filter out ellipsis (e.g., \cdots) but not single dots (e.g., \cdot). 
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an appropriate counterpart in the CAS are meaningless to evaluate. Definitions of 
semantic macros, on the other hand, are of special interest and remain in the test set 
since they allow us to test if a function in the CAS obeys the actual mathematical 
definition in the DLMF. If the case analyzer (see Figure 2) is unable to detect a 
relation, i.e., split an expression on <, <, >, >, =, or Æ, the line in the dataset is also 
skipped because the evaluation approach relies on relations to test. After splitting 
multi-equations (e.g., +, F, a= b= c), filtering out all non-semantic expressions, 
non-semantic macro definitions, ellipsis, approximations, and asymptotics, we end up 
with 6,623 test cases in total from the entire DLMF. 

After generating the test case with all constraints, we translate the expression to 
the CAS representation. Every successfully translated test case is then symbolically 
verified, i.e., the CAS tries to simplify the difference of an equation to zero. Non- 
equation relations simplifies to Booleans. Non-simplified expressions are verified 
numerically for manually defined test values, i.e., we calculate actual numeric values 
for both sides of an equation and check their equivalence. 


4.1 Symbolic Evaluation 


The symbolic evaluation was performed for Maple as in [7]. However, we use the 
newer version Maple 2020. Another feature we added to /CasT is the support of 
packages in Maple. Some functions are only available in modules (packages) that 
must be preloaded, such as QPochhammer in the package QDifferenceEquations'®. 
The general simplify method in Maple does not cover g-hypergeometric functions. 
Hence, whenever 4CasT loads functions from the q-hyper-geometric package, the 
better performing QSimplify method is used. With the WED and the new support for 
Mathematica in ACasT, we perform the symbolic and numeric tests for Mathematica 
as well. The symbolic evaluation in Mathematica relies on the full simplification!®. For 
Maple and Mathematica, we defined the global assumptions x,y ER and k,n,m EN. 
Constraints of test cases are added to their assumptions to support simplification. 
Adding more global assumptions for symbolic computation generally harms the 
performance since CAS internally uses assumptions for simplifications. It turned 
out that by adding more custom assumptions, the number of successfully simplified 
expressions decreases. 


4.2 Numerical Evaluation 


Defining an accurate test set of values to analyze an equivalence can be an arbitrarily 
complex process. It would make sense that every expression is tested on specific values 
according to the containing functions. However, this laborious process is not suitable 
for evaluating the entire DML and CAS. It makes more sense to develop a general set 
of test values that (i) generally covers interesting domains and (ii) avoid singularities, 
branch cuts, and similar problematic regions. Considering these two attributes, we 
come up with the ten test points illustrated in Figure 3. It contains four complex 
values on the unit circle and six points on the real axis. The test values cover the 
18 nttps://jp-maplesoft . com/support/help/Maple/view.aspx?path= 
QDifferenceEquations/QPochhammer [accessed 09/01/2021] 


19 nttps://reference.wolfram.com/language/ref /FullSimplify. html 
[accessed 09/01/2021] 
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general area of interest (complex values in all four quadrants, negative and positive 
real values) and avoid the typical singularities at {0,+1,-+7}. In addition, several 
variables are tied to specific values for entire sections. Hence, we applied additional 
global constraints to the test cases. 


The numeric evaluation engine Test Values $ Special Test Values 

; 2i n,m,k,bli,j;e,£ € {1,2,3} 
heavily relies on the performance of ER T in ens 

š % 7 S : 

extracting free variables from an ex- / . Global Constraints 

pression. Unfortunately, the inbuilt sn aa a R Be 5 k 

functions in CAS, if available, are Zna? 2, 2 £,Y.,0,C,7,8,,0,8 ER 

Sas sx 


not very reliable. As the authors ex- @ 6 Siet Lin 


plained in [7], a custom algorithm 
within Maple was necessary to Fig.3: The ten numeric test values in the com- 


extract identifiers. Mathematica Plex plane for general variables. The dashed line 
has the undocumented function represents the unit circle |z|=1. At the right, 
Reduce‘FreeVariables for this We show the set of values for special variable 
purpose. However, both systems, values and general global constraints. On the 
the custom solution in Maple and right, 7 is referring to a generic variable and not 
the inbuilt Mathematica function, tO the imaginary unit. 

have problems distinguishing free variables of entire expressions from the bound 
variables in MEOMs, e.g., integration and continuous variables. Mathematica some- 
times does not extract a variable but returns the unevaluated input instead. We 
regularly faced this issue for integrals. However, we discovered one example without 
integrals. For EulerE[n,0] from [10, (24.4.26)], we expected to extract {n} as the 
set of free variables but instead received a set of the unevaluated expression itself 
{EulerE[n,0] }°°. Since the extended version of IACasT handles operators, including 
bound variables of MEOMs, we drop the use of internal methods in CAS and extend 
TFACasT to extract identifiers from an expression. During a translation process, [ACasT 
tags every single identifier as a variable, as long as it is not an element of a MEOM. 
This simple approach proves to be very efficient since it is implemented alongside the 
translation process itself and is already more powerful as compared to the existing 
inbuilt CAS solutions. We defined subscripts of identifiers as a part of the identifier, 
e.g., z1 and zg are extracted as variables from z1 +22 rather than z. 


The general pipeline for a numeric evaluation works as follows. First, we replace 
all substitutions and extract the variables from the left- and right-hand sides of 
the test expression via ACasT. For the previously mentioned example of the Struve 
function [10, (11.5.2)], IACasT identifies two variables in the expression, v and z. 
According to the values in Figure 3, v and z are set to the general ten values. A 
numeric test contains every combination of test values for all variables. Hence, we 
generate 100 test calculations for [10, (11.5.2)]. Afterward, we filter the test values 
that violate the attached constraints. In the case of the Struve function, we end up 
with 25 test cases. 


In addition, we apply a limit of 300 calculations for each test case and abort 
a computation after 30 seconds due to computational limitations. If the test case 
generates more than 300 test values, only the first 300 are used. Finally, we calculate 


20 The bug was reported to and confirmed by Wolfram Research Version 12.0. 
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the result for every remaining test value, i.e., we replace every variable by their value 
and calculate the result. The replacement is done by Mathematica’s ReplaceAll 
method because the more appropriate method With, for unknown reasons, does not 
always replace all variables by their values. We wrap test expressions in Normal 
for numeric evaluations to avoid conditional expressions, which may cause incorrect 
calculations (see Section 5.1 for a more detailed discussion of conditional outputs). 
After replacing variables by their values, we trigger numeric computation. If the 
absolute value of the result (i-e., the difference between left- and right-hand side of the 
equation) is below the defined threshold of 0.001 or true (in the case of inequalities), 
the test calculation is considered successful. A numeric test case is only considered 
successful if and only if every test calculation was successful. If a numeric test case 
fails, we store the information on which values it failed and how many of these were 
successful. 


5 Results 


The translations to Maple and Mathematica, the symbolic results, the numeric com- 
putations, and an overview PDF of the reported bugs to Mathematica are available 
online on our demopage. In the following, we mainly focus on Mathematica because 
of page limitations and because Maple has been investigated more closely by [7]. The 
results for Maple are also available online. Compared to the baseline (~31%), our 
improvements doubled the amount translations (62%) for Maple and reach +71% 
for Mathematica. The majority of expressions that cannot be translated contain 
macros that have no adequate translation pattern to the CAS, such as the macros for 
interval Weierstrass lattice roots [10, §23.3(i)] and the multivariate hypergeometric 
function [10, (19.16.9)]. Other errors (6% for Maple and Mathematica) occur for 
several reasons. For example, out of the 418 errors in translations to Mathematica, 
130 caused an error because the MEOM of an operator could not be extracted, 86 
contained prime notations that do not refer to differentiations, 92 failed because of 
the underlying TeX parser [34], and in 46 cases, the arguments of a DLMF macro 
could not be extracted. 

Out of 4,713 translated expressions, 1,235 (26.2%) were successfully simplified 
by Mathematica (1,084 of 4,114 or 26.3% in Maple). For Mathematica, we also 
count results that are equal to 0 under certain conditions as successful (called 
ConditionalExpression). We identified 65 of these conditional results: 15 of the 
conditions are equal to constraints that were provided in the surrounding text but 
not in the info box of the DLMF equation; 30 were produced due to branch cut 
issues (see Section B in the Appendix); and 20 were the same as attached in the 
DLMF but reformulated, e.g., z€C\(1,00) from [10, (25.12.2)] was reformulated to 
Sz4OVRz<1. The remaining translated but not symbolically verified expressions 
were numerically evaluated for the test values in Figure 3. For the 3,474 cases, 784 
(22.6%) were successfully verified numerically by Mathematica (698 of 2,618 or 26.7% 
by Maple”). For 1,784 the numeric evaluation failed. In the evaluation process, 655 


21 Due to computational issues, 120 cases must have been skipped manually. 292 cases 
resulted in an error during symbolic verification and, therefore, were skipped also for 
numeric evaluations. Considering these skipped cases as failures, decreases the numerically 
verified cases to 23% in Maple. 
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computations timed out and 180 failed due to errors in Mathematica. Of the 1,784 
failed cases, 691 failed partially, i.e., there was at least one successful calculation 
among the tested values. For 1,091 all test values failed. Table 3 in the Appendix D 
shows the results for three sample test cases. The first case is a false positive evaluation 
because of a wrong translation. The second case is valid, but the numeric evaluation 
failed due to a bug in Mathematica (see next subsection). The last example is valid 
and was verified numerically but was too complex for symbolic verifications. 

5.1 Error Analysis 


The numeric tests’ performance strongly depends on the correct attached and utilized 
information. The first example in Table 3 in the Appendix D illustrates the difficulty 
of the task on a relatively easy case. Here, the argument of f was not explicitly 
given, such as in f(a). Hence, JACasT translated f as a variable. Unfortunately, this 
resulted in a false verification symbolically and numerically. This type of error mostly 
appears in the first three chapters of the DLMF because they use generic functions 
frequently. We hoped to skip such cases by filtering expressions without semantic 
macros. Unfortunately, this derivative notation uses the semantic macro deriv. In 
the future, we filter expressions that contain semantic macros that are not linked to 
a special function or orthogonal polynomial. 

As an attempt to investigate the reliability of the numeric test pipeline, we can run 
numeric evaluations on symbolically verified test cases. Since Mathematica already 
approved a translation symbolically, the numeric test should be successful if the 
pipeline is reliable. Of the 1,235 symbolically successful tests, only 94 (7.6%) failed 
numerically. None of the failed test cases failed entirely, i.e., for every test case, at 
least one test value was verified. Manually investigating the failed cases reveal 74 cases 
that failed due to an Indeterminate response from Mathematica and 5 returned 
infinity, which clearly indicates that the tested numeric values were invalid, e.g., 
due to testing on singularities. Of the remaining 15 cases, two were identical: [10, 
(15.9.2)] and [10, (18.5.9)]. This reduces the remaining failed cases to 14. We evaluated 
invalid values for 12 of these because the constraints for the values were given in 
the surrounding text but not in the info boxes. The remaining 2 cases revealed a 
bug in Mathematica regarding conditional outputs (see below). The results indicate 
that the numeric test pipeline is reliable, at least for relatively simple cases that 
were previously symbolically verified. The main reason for the high number of failed 
numerical cases in the entire DLMF (1,784) are due to missing constraints in the 
i-boxes and branch cut issues (see Section B in the Appendix), i.e., we evaluated 
expressions on invalid values. 


Bug reports Mathematica has trouble with certain integrals, which, by default, 
generate conditional outputs if applicable. With the method Normal, we can suppress 
conditional outputs. However, it only hides the condition rather than evaluating 
the expression to a non-conditional output. For example, integral expressions in [10, 
(10.9.1)] are automatically evaluated to the Bessel function Jo(|z|) for the condition”? 
z E€ R rather than Jo(z) for all z € C. Setting the GenerateConditions”’ option 
22 Jo(x) with «ER is even. Hence, Jo(|z|) is correct under the given condition. 


3 nttps://reference.wolfram.com/language/ref/GenerateConditions.html [accessed 
09/01/2021] 
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to None does not change the output. Normal only hides z € R but still returns 
Jo(|z|). To fix this issue, for example in (10.9.1) and (10.9.4), we are forced to set 
GenerateConditions to false. 

Setting GenerateConditions to false, on the other hand, reveals severe errors 
in several other cases. Consider f° t7te~™*dt [10, (8.4.4)], which gets evaluated to 
T(0,z) but (condition) for Rz >OASz=0. With GenerateConditions set to false, 
the integral incorrectly evaluates to ['(0,z)+In(z). This happened with the 2 cases 
mentioned above. With the same setting, the difference of the left- and right-hand 
sides of [10, (10.43.8)] is evaluated to 0.398942 for z,v = 1.5. If we evaluate the 
same expression on x, V = 3 the result is Indeterminate due to infinity. For 
this issue, one may use NIntegrate rather than Integrate to compute the integral. 
However, evaluating via NIntegrate decreases the number of successful numeric 
evaluations in general. We have revealed errors with conditional outputs in (8.4.4), 
(10.22.39), (10.43.8-10), and (11.5.2) (in [10]). In addition, we identified one critical 
error in Mathematica. For [10, (18.17.47)], WED (Mathematica’s kernel) ran into 
a segmentation fault (core dumped) for n > 1. The kernel of the full version of 
Mathematica gracefully died without returning an output?*. 

Besides Mathematica, we also identified several issues in the DLMF. None of the 
newly identified issues were critical, such as the reported sign error from the previous 
project [7], but generally refer to missing or wrong attached semantic information. 
With the generated results, we can effectively fix these errors and further semantically 
enhance the DLMF. For example, some definitions are not marked as such, e.g., 
Q(z)= fo eo “a(t)de [10, (2.4.2)]. In [10, (10.24.4)], v must be a real value but was 
linked to a complex parameter and «x should be positive real. An entire group of 
cases [10, (10.19.10-11)] also discovered the incorrect use of semantic macros. In 
these formulae, P,(a) and Q,(a) are defined but had been incorrectly marked up as 
Legendre functions going all the way back to DLMF Version 1.0.0 (May 7, 2010). In 
some cases, equations are mistakenly marked as definitions, e.g., [10, (9.10.10)] and 
[10, (9.13.1)] are annotated as local definitions of n. We also identified an error in 
FACasP, which incorrectly translated the exponential integrals F(z), Ei(x) and Ein(z) 
(defined in [10, §6.2(i)]). A more explanatory overview of discovered, reported, and 
fixed issues in the DLMF, Mathematica, and Maple is provided in the Appendix C. 


6 Conclusion 


We have presented a novel approach to verify the theoretical digital mathematical 
library DLMF with the power of two major general-purpose computer algebra systems 
Maple and Mathematica. With ACasT, we transformed the semantically enhanced 
TTX expressions from the DLMF to each CAS. Afterward, we symbolically and 
numerically evaluated the DLMF expressions in each CAS. Our results are auspicious 
and provide useful information to maintain and extend the DLMF efficiently. We 
further identified several errors in Mathematica, Maple [7], the DLMF, and the 
transformation tool ACasT, proving the profit of the presented verification approach. 


24 All errors were reported to and partially confirmed by Wolfram Research. See Appendix C 
for more information. 
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Further, we provide open access to all results, including translations and evaluations”. 


and to the source code of IAC\sT?®. 

The presented results show a promising step towards an answer for our initial 
research question. By translating an equation from a DML to a CAS, automatic 
verifications of that equation in the CAS allows us to detect issues in either the DML 
source or the CAS implementation. Each analyzed failed verification successively 
improves the DML or the CAS. Further, analyzing a large number of equations from 
the DML may be used to finally verify a CAS. In addition, the approach can be 
extended to cover other DML and CAS by exploiting different translation approaches, 
e.g., via MATHML [31] or OpenMath [18]. 

Nonetheless, the analysis of the results, especially for an entire DML, is cumber- 
some. Minor missing semantic information, e.g., a missing constraint or not respected 
branch cut positions, leads to a relatively large number of false positives, i.e., unverified 
expressions correct in the DML and the CAS. This makes a generalization of the 
approach challenging because all semantics of an equation must be taken into account 
for a trustworthy evaluation. Furthermore, evaluating equations on a small number 
of discrete values will never provide sufficient confidence to verify a formula, which 
leads to an unpredictable number of true negatives, i.e., erroneous equations that 
pass all tests. A more sophisticated selection of critical values or other numeric tools 
with automatic results verification (such as variants of Newton’s interval method) 
potentially mitigates this issue in the future. After all, we conclude that the approach 
provides valuable information to complement, improve, and maintain the DLMF, 
Maple, and Mathematica. A trustworthy verification, on the other hand, might be 
out of reach. 


6.1 Future Work 


The resulting dataset provides valuable information about the differences between 
CAS and the DLMF. These differences had not been largely studied in the past 
and are worthy of analysis. Especially a comprehensive and machine-readable list 
of branch cut positioning in different systems is a desired goal [9]. Hence, we will 
continue to work closely together with the editors of the DLMF to improve further 
and expand the available information on the DLMF. Finally, the numeric evaluation 
approach would benefit from test values dependent on the actual functions involved. 
For example, the current layout of the test values was designed to avoid problematic 
regions, such as branch cuts. However, for identifying differences in the DLMF and 
CAS, especially for analyzing the positioning of branch cuts, an automatic evaluation 
of these particular values would be very beneficial and can be used to collect a 
comprehensive, inter-system library of branch cuts. Therefore, we will further study 
the possibility of linking semantic macros with numeric regions of interest. 
Acknowledgements We thank Jiirgen Gerhard from Maplesoft for providing access and 
support for Maple. We also thank the DLMF editors for their assistance and support. This 
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Abstract. Fortran is widely used in computational science, engineer- 
ing, and high performance computing. This paper presents an extension 
to the CIVL verification framework to check correctness properties of 
Fortran programs. Unlike previous work that translates Fortran to C, 
LLVM IR, or other intermediate formats before verification, our work 
allows CIVL to directly consume Fortran source files. We extended the 
parsing, translation, and analysis phases to support Fortran-specific fea- 
tures such as array slicing and reshaping, and to find program violations 
that are specific to Fortran, such as argument aliasing rule violations, in- 
valid use of variable and function attributes, or defects due to Fortran’s 
unspecified expression evaluation order. We demonstrate the usefulness 
of our tool on a verification benchmark suite and kernels extracted from 
a real world application. 


Keywords: Fortran - verification - static analysis - model checking 


1 Introduction 


Fortran is a structured imperative programming language with a unique set of 
features, such as common data blocks and array reshaping and sectioning, that 
support efficient numerical computing. Many scientific applications, especially 
those requiring high performance, are written entirely in Fortran; others have 
core subroutines or rely on external components written in Fortran. A 2018 
report from the European Performance Optimisation and Productivity Centre 
states that over half of the 151 HPC programs the centre had analyzed over a 
two-year period were written in pure Fortran or a combination of Fortran and 
C or C++ [10]. Likewise, 12 of the 33 HPC benchmark applications in the U.S. 
Department of Energy’s widely-used CORAL suite have components written in 
Fortran [17]. 

The Fortran language has been used and revised for decades, and it has had 
many standard versions. Early versions of Fortran employed a fixed-form coding 
style, well-suited for punch cards and with strict positional constraints. Begin- 
ning with Fortran 90, a free-form style was introduced, enabling more structured 
programs, eliminating limits on line lengths, and providing more flexibility with 
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character positioning by removing the restriction that the first six columns could 
be used only for labels and continuation characters. Modern Fortran programs 
tend to use the free-form style, but programs derived from a Fortran 77 prede- 
cessor or relying on legacy components may rely on fixed-form style or a mix of 
both styles. 

Fortran is used to implement applications such as Nek5000 [21] or Flash 
[32] that are used for critical tasks such as nuclear reactor licensing reviews or 
to answer important scientific questions. These applications are often compu- 
tationally demanding, requiring hours of computation on millions of execution 
units. Because of the critical importance and high resource requirements of these 
applications, one would like to verify their correctness. 


The Fortran language itself provides little support for verification—not even 
assertions. Compilers can check certain simple syntactic and semantic properties, 
and static analyzers such as Coverity [35] can detect standard violations and 
other anomalies. But there are very few tools that can be used to specify and 
verify deeper functional correctness properties of programs, and nothing like the 
rich ecosystem of formal verification tools for C. 


One might approach Fortran program verification by using a source-to-source 
translator such as f2c [11] to convert to C and then applying a C verifier. Unfor- 
tunately, even if the translator provides a completely valid translation, defects 
in the original code may not be preserved in the translated code; an example 
is given in Section 3.1. In addition, the C verifier may not be able to access 
translator support libraries, or defects that manifest themselves via the library 
may be difficult to map back to the original program. 


A second approach is to use a compiler front end to convert Fortran code into 
an intermediate form such as the LLVM [16] Intermediate Representation (IR), 
and then apply a verifier for the IR. This is more difficult than it appears: most 
verifiers that consume LLVM IR are tuned to a specific source language and front 
end and cannot be easily modified to effectively verify multiple languages. This 
issue is explored in [13] in the case of SMACK, a C-via-LLVM verifier that has 
been extended to provide limited support for other languages, including Fortran. 
Moreover, as with source-to-source translators, the front end may translate away 
a defect in the original program; this is discussed in Section 3.2. 

In this paper, we present an approach to extending the CIVL [33] verifica- 
tion framework so that it can be directly applied to Fortran source code. CIVL 
is a model checker that uses symbolic execution to verify correctness properties 
and was originally designed for programs written in C with a set of parallel 
programming language extensions such as OpenMP [26]. In our extended frame- 
work, summarized in Section 2, a new Fortran front end with a static analyzer 
has been integrated into the system. In Section 3 we describe the sequence of 
defect-preserving transformations that convert the Fortran source to the CIVL 
intermediate verification language, CIVL-C. Proper handling of arrays is a spe- 
cial concern, discussed in Section 4. The Fortran extension supports a subset of 
the major features defined in the language standard, focused on those features 
necessary to verify code excerpts from real world applications. 
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In Section 5, we evaluate our approach by verifying several examples of For- 
tran code, including (1) a custom Fortran benchmark suite designed to test 
CIVL’s ability to verify programs using unique Fortran features such as array 
slicing and reshaping, (2) a published micro verification benchmark [13], and 
(3) a set of code excerpts from Nek5000 [21]. The evaluation employs both of 
CIVL’s verification modes on Fortran programs. The first uses assumptions and 
assertions inserted in the program to specify the desired correctness properties. 
The second compares two programs with the same input-ouput signatures to 
determine whether they are functionally equivalent. 

Related work is discussed in Section 6, and conclusions and future work are 
summarized in Section 7. 


2 Overview of CIVL Extension 


The Concurrency Intermediate Verification Language (CIVL) platform was de- 
veloped to verify C programs that use various concurrency language extensions 
[33]. CIVL has two primary components: a front end and a back end verifier. 
The front end consumes a set of source files, which, prior to this work, had to be 
written in C or CUDA-C, possibly using certain CIVL extensions to C. These 
source files may use one or more concurrency language extensions, including 
MPI [18], OpenMP [26], Pthreads [25], and CUDA-C [24]. The input is parsed, 
analyzed and merged to create a single abstract syntax tree (AST) representing 
the whole program. This AST then undergoes a sequence of transformations to 
replace all of the concurrency primitives with equivalent CIVL-C primitives, and 
to simplify the AST in other ways, resulting in a “pure” CIVL-C AST. 


The back end first converts the pure AST to a lower-level representation in 
which each procedure is represented as a program graph. A node in this graph 
represents a program counter value, i.e., a location in the procedure body. An 
edge represents an atomic transition, and is decorated with a guard expression 
that specifies when the transition is enabled, and a basic statement, such as 
an assignment. The verifier then performs an explicit enumeration of the reach- 
able states of the program (“model checking”). This is carried out by depth-first 
search, while saving the seen states in a hash table. Each state maps variables 
to symbolic expressions and includes a path condition—a symbolic expression of 
boolean type that records the guards that held along the explored path (“sym- 
bolic execution”). An interleaving model of concurrency is used, and processes 
can be created and destroyed dynamically. 

During the search, automated theorem provers are invoked to determine 
whether the path condition has become unsatisfiable (in which case the search 
backtracks) and to check assertions. CIVL checks both explicit assertions ap- 
pearing in the program and implicit assertions (a divisor is not 0, a pointer 
deference is valid, and so on). The supported provers include Z3 [20], CVC4 [4], 
Why3 [5], and a number of additional provers invoked by Why3. 
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C parse 


sources 


Fortran parse 


Fig. 1. CIVL architecture: front end (top) and back end (bottom) 


Figure 1 shows the tool prior to this work and highlights the extensions 
developed as part of this paper. Modifications were made to both the front and 
back end to enable the direct application of CIVL to Fortran source code. 

The CIVL preprocessor was generalized to accept a superset of C and For- 
tran: it is common practice to use C preprocessor directives in Fortran programs 
and Fortran compilers can invoke the preprocessor as a first pass. The tokens 
emanating from the preprocessor have a type specific to the source language—C 
or Fortran—which is determined by the file suffix or a command line option. It 
is possible to invoke CIVL on a mix of C and Fortran source files—each will 
be preprocessed separately and yield a separate stream of tokens in the correct 
language. 

Each Fortran token stream enters the Fortran parser. This was produced by 
the parser generator ANTLR [28] using a grammar derived from the Open For- 
tran Project (OFP) [31]. We extended the grammar by adding support for CIVL 
primitives, such as assertions and assumptions, which can appear as structured 
comments in the Fortran source. The parser produces a parse tree, which is then 
converted to a CIVL-C AST. Each C token stream follows a similar path, and 
also results in a CIVL-C AST. Finally, the individual ASTs, together with ASTs 
generated from any libraries, are merged into a single AST, analogous to the 
linking phase in a standard compilation flow. The supported Fortran subset is 
listed below: 


— program units: main programs, subroutines, and functions 

— statements: allocate, assignment, call, computed goto, data, dimension, do, 
exit, goto, if, implicit, intent, parameter, pointer assignment, print, return, 
stop, target, type declaration, and write 

— expressions: variable references, function calls, operators for scalar types 

— intrinsic functions: mod, max, abs, sin, cos, atan, and sqrt 

— extended features: CIVL preprocessor directives and CIVL primitives. 


The transformation from a Fortran parse tree to a CIVL-C AST is quite 
involved because the languages differ substantially. In almost every case, we were 
able to find a way to represent a Fortran statement—in a semantics-preserving 
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and defect-preserving way—using existing CIVL-C AST nodes; in a few cases 
we had to add new fields to the AST node. Issues include the Fortran “intent” 
specification for a procedure parameter (in, out, or in/out); pass-by-reference 
semantics; and advanced array operations. Details for some of these translations 
are described in Section 3. 

The verifier was also upgraded to check specific Fortran runtime constraints 
during state exploration. For example, the verifier normally uses short-circuit 
semantics for evaluating and and or expressions. This is appropriate for C, but 
Fortran does not mandate short-circuiting or the order in which subexpressions 
are evaluated. Since evaluation can result in error, a verifier which assumes 
short-circuiting semantics could miss defects in a Fortran program. By default, 
our modified verifier turns off short-circuiting for Fortran code. 


3  Defect-Preserving Translation 


When used for verification, it is crucial that all translation phases preserve de- 
fects. This is in contrast to translation and lowering phases in a compiler, which 
generally are allowed to narrow the semantics of a program or choose arbitrarily 
from multiple interpretations. In this section, we first demonstrate with small 
examples that an approach relying on existing source-to-source translation tools 
such as f2c, or compiler front ends such as Flang, is bound to miss certain de- 
fects in the Fortran input, since these defects are removed by these tools. One 
might be tempted to argue that defects which disappear during translation or 
compilation are not really important. However, these defects are still present in 
the original source code and may manifest themselves when a different compiler 
is used or when other seemingly innocent changes are made to the code or the 
translation/compilation tool chain. 


3.1 Translation from Source to Source 


Figure 2 shows a procedure in Fortran 77 and its C translation produced by f2c. 
The example extracts a value x from an array at the given index, and computes 
max(z, 0). An array bounds check is performed in the same boolean expression in 
which the array is accessed. The C code is certainly valid, because it uses short- 
circuiting when evaluating logic expressions. Thus, the evaluation of the second 
part of the boolean expression is skipped if the first part is false. Fortran, on the 
other hand, does not define the order in which the subexpressions are evaluated, 


1 if (idx .le. size_arr .and. 1 /* Function Body */ 

2 arr(idx) .ge. 0) then 2 if (*idx <= *arr_size__ && arr[*idx] >= 0.f) { 
3 relu = arr(idx) 3 *relu = arr[*idx]; 

4 else 4} else { 

5 relu = 0 5 *relu = 0.f; 

6 end if 6 } 


Fig. 2. Applying f2c to Fortran and operator removes a defect 
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and the compiler may choose an evaluation order that causes an out-of-bounds 
access in the second half of the expression. The implementation-defined order 
chosen by f2c happens to remove this defect during translation, which makes 
it difficult to detect for a verifier that is only provided with the C program. 
Nevertheless, the Fortran program may break when a different translator or 
compiler tool chain is used to execute it. 

Besides the lack of defect preservation, there are other drawbacks when us- 
ing a source-to-source converter, including the fact that some of them introduce 
hard-to-verify external headers or libraries to simulate Fortran behaviors. Fur- 
ther, by verifying translated code, source file information (e.g., file and identifier 
names, code locations, etc.) can be harder to communicate to the user, and 
translation tools may actually introduce new errors, leading to another poten- 
tial source of unreliable verification results. 


3.2 Translation for Compilation 


A popular approach for verifying source code is to build a verifier based on a 
mature compiler tool chain (e.g., LLVM [16]). This allows verification researchers 
to spend more of their time on research and less time on maintaining language 
front ends, and allows robust support of a variety of languages. We argue that 
such an approach, while also very valuable, achieves a different outcome than 
what we present in our work. Compiler front ends such as Clang or Flang are 
not developed with the goal of preserving defects, and defective programs may 
be lowered into correct LLVM intermediate representation (IR). Furthermore, 
the compiler may in rare cases introduce new defects due to compiler bugs. In 
the absence of such compiler bugs, verification based on the IR will ensure that 
the input program is correct if compiled with the same compiler and settings that 
were used for verification. With our approach, we instead aim to verify that a 
program adheres to the language standard. 

Figure 3 shows the LLVM-IR produced by Flang (version 1.5 2017-05-01) for 
the Fortran code snippet in Figure 2. Similar to the case with f2c, it first checks 
the array bounds by comparing %15 (element index) with %17 (array size). If 
the index is out of bounds (i.e., %18 is evaluated as true), then the control flow 
skips the block that accesses the array elements, and the second subexpression 


1 L.LB1_339: ; preds = %L.entry 


6 418 
7 bri 
Diy 

9 L.LB1_313: ; preds = %L.LB1_349, %L.LB1_339 

10 %41 = bitcast i64* Zrelu to float*, !dbg !21 

11 store float 0.000000e+00, float* %41, align 4, !dbg !21 
12 br label %L.LB1_314 


icmp sgt i32 %15, %17, !dbg !18 
%18, label %L.LB1_313, label %L.LB1_349, !dbg !18 


2 %14 = bitcast i64* %idx to i32*, !dbg !18 
3 %15 = load i32, i32* %14, align 4, !dbg !18 
4 %16 = bitcast i64* %arr_size to i32*, !dbg !18 
5 %17 = load i32, i32* %16, align 4, !dbg !18 
1 


Fig. 3. Result of applying Flang to Fortran code of Figure 2 
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? iver 
subroutine intent_bad(i) subroutine intent_good(i) void INCR(int* __OUT_I) { 


H H r í : z n int I; 
integer, intent(out) :: i integer, intent(inout) :: i TeTri 
i=i+1 i=i+1 7 : 
; : *__OUT_I = I; 
end subroutine end subroutine 


} 


Fig. 4. Fortran routine that fails to conform to specified intent; one that conforms; and 
CIVL translation of the non-conforming code 


in the condition expression is omitted. This means that the defect in the original 
Fortran code is undetectable in the IR. 

Figure 4 is another case where Flang translates an incorrect program into 
valid IR. A Fortran subroutine may use the INTENT attribute in an illegal way, 
for example by declaring an argument as INTENT(OUT) and subsequently reading 
from it. This is problematic since the value of such a variable is undefined at 
the entry of the subroutine, even if it was initialized in the caller. Flang never- 
theless generates identical LLVM IR for the two subroutines, the first of which 
violates the Fortran standard, the second of which declares the same argument 
as INTENT(INOUT) and hence correctly passes the variable into and out of the 
subroutine. 


3.3 Translation for Verification 


Based on these observations, we extended CIVL with a front end to translate 
Fortran to CIVL-C ASTs in a way that is designed to preserve defects. The 
front end avoids AST simplifications and optimizations that may introduce or 
remove defects, or that may hide violations of the Fortran language standard. 
The short-circuit evaluation of logic expressions is disabled by default when ver- 
ifying Fortran source. When processing the code of Figure 2, the CIVL-C AST 
builder thus keeps both subexpressions in the condition, and all parts of the ex- 
pression are evaluated in the verification phase. The model checker consequently 
reports an out-of-bounds access in the array.’ 

We also developed a static analyzer to detect certain defects before the pro- 
gram is even translated to a CIVL-C AST. The analyzer mainly checks con- 
straints on variable attributes or procedure specifications. For example, variables 
in Fortran may have the ALLOCATABLE, POINTER, or TARGET attribute. It is le- 
gal to pointer-assign a variable with the POINTER attribute to a variable with 
the TARGET or POINTER attribute, but not to a variable without any of these 
attributes. Both sides of each pointer assignment are statically checked for re- 
quired attributes by the analyzer. When all constraints of a specific attribute 
are verified for each associated variable, that attribute information is not passed 
to the model checker. Similarly, a subroutine or function is only allowed to be 
recursively called if it has the RECURSIVE attribute, and our analyzer checks this 
by searching for loops in the call graph and checking if subroutines or functions 

3 It is also possible (however unlikely) that a defect may manifest only when short-circuiting is 


enabled. A strictly conservative solution could use nondeterministic choice to decide, at each 
logical expression, whether to short-circuit. We plan to add such an option to CIVL. 
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that are part of a loop have the required attribute. As a result, this kind of 
constraint is checked by the analyzer and it is not necessary to include certain 
attributes in the CIVL-C AST. 

The defect-preserving translation is mainly performed by the Fortran AST 
builder shown in Figure 1. For properties that can not be verified by the analyzer, 
the translation phase inserts auxiliary structures into the CIVL-C AST for ver- 
ification in a later phase. For example, the subroutines in Figure 4 have distinct 
CIVL-C AST structures. A formal parameter having INTENT(OUT) attribute is 
initialized with a value representing “undefined.” This allows the model checker 
to find and report a violation (reading an undefined value) during the transi- 
tion executing the assignment statement. The CIVL translation of the incorrect 
routine is shown in Figure 4(right). 

In summary, our extended front end focuses on preserving defects and trans- 
lates source code into a CIVL-C AST specifically designed for verification. Vio- 
lations of variable attributes and function specifications are guaranteed by per- 
forming specialized analysis or by inserting auxiliary information into the AST 
that is analyzed in a later phase. 


4 Fortran Array Modeling 


Fortran arrays are more powerful than arrays in most other languages, and re- 
quire special handling during the translation to CIVL-C. Section 4.1 will briefly 
discuss some of the features of Fortran arrays, before we discuss how these fea- 
tures are modeled in Section 4.2. 


4.1 Fortran Array Semantics 


Fortran natively supports multi-dimensional arrays. For example, b and c in 
Figure 5 are two-dimensional arrays. Fortran stores arrays in column major 
style, unlike C arrays, which are stored in row major style. 


1 REAL:: b(6,3), c(0:9,-3:3), u(3) 

2 REAL, POINTER, DIMENSION(:) :: p 

3 INTEGER, DIMENSION(3) :: idx 

4 ! copy columns -1, 0, 1 from every other row of c into the first 5 rows of b 
5 b(1:5,:) = c€::2,-1:1) 

6 ! fill the array idx with constant values 1, 4, 17 

7 idx = (/1, 4, 17/) 

8 ! use the array idx as indices into a. This will copy a[1,4,17] into u[1,2,3] 
9 u = a(idx) 

10 ! associate the pointer p with column 1 in array c 

11 p => c(:,1) 

12 b = 42.0 


Fig. 5. Examples of Fortran array usage: a 2-dimensional array of size 6 x 3, a 2- 
dimensional array with non-default index ranges, a pointer to a one-dimensional array, 
and two one-dimensional arrays of size 3, one for integers and one for reals, are declared. 
Following that, several data copy operations and pointer associations are performed. 
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Arrays in Fortran are 1-based by default, just like in Matlab or Julia, but 
unlike in C and many other languages. However, Fortran allows the base to be 
specified for each array dimension. For example, c in Figure 5 represents a two 
dimensional array whose row dimension of size 10 is 0-based and whose column 
dimension ranges from —3 to 3. Array sizes and index ranges can be either defined 
statically or calculated from parameters or function and subroutine arguments. 


Fortran programs can in most situations determine the size of arrays using 
the intrinsic size or shape functions. It is also possible to modify an entire array, 
or an array along an entire dimension, without explicitly referring to its size. For 
example, one can assign a scalar value to an entire array though a simple assign- 
ment as shown in line 12 of Figure 5. Fortran compilers usually implement this 
behavior using an array descriptor that is embedded in the generated program 
and contains the array size and shape information. 


Furthermore, Fortran supports the extraction of slices from an array by spec- 
ifying a subscript triplet for each dimension, which specifies a lower and upper 
bound on the index as well as a stride. It is possible to omit the lower (and/or 
upper) bound, in which case the start (and/or end) of the array is used. An op- 
tional stride n can be specified to extract only every n-th element. For example, 
line 5 in Figure 5 extracts even rows, and of those, only the columns from —1 
to 1, from c. These values are then copied into the first five rows of b. Instead 
of subscript triplets, one can also use an integer array as an index for another 
array. This is shown in line 9. Fortran provides other ways to modify or rein- 
terpret arrays, including the reshape function that can change the number of 
dimensions and the size in each dimension, and has optional arguments to pad 
or reorder an array. 


When an array is passed to a function or subroutine as an argument, it may 
be accessed with a different index scheme inside that function or subroutine. For 
example, a three-dimensional array with index ranges [0 : 8][0 : 2][0 : 2] could be 
passed to a subroutine that internally declares this argument as an array with 
ranges [1 : 9][1 : 3][1 : 3] or [0 : 8][0 : 8] or any other number of dimensions 
or index ranges, as long as the array within the callee has at most as many 
overall entries as the array within the caller. This essentially provides a view 
of the original array, and because Fortran uses the call-by-reference paradigm, 
any changes to this re-interpreted array within the callee will also affect the 
original array in the caller. Depending on the situation, the Fortran compiler 
may implement this using an array descriptor and suitable index expressions, 
or by transparently copying data to and from an array that is used within the 
callee. 


A similar situation occurs when Fortran pointers are used. Despite their 
similar name with C pointers, their behavior and features differ significantly. 
Fortran pointers can represent a view into a multi-dimensional array, and contain 
size and shape information. For example, a pointer can be associated with an 
array slice that represents column 1 across all rows in an array, as shown in line 
11 of Figure 5. In this case, writing to the first element in p will also modify the 
first row in c’s column 1. The size and shape functions can be used on p and 
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will return the size and shape of the portion of c that p is associated with. The 
pointer itself can be accessed with a subscript triplet or index array, and the 
pointer can be passed to a subroutine or function that may reinterpret it with a 
different dimensionality or index range. 

There are a number of details regarding the use of arrays and pointers in For- 
tran that we do not discuss in this paper for brevity. We refer to [1] (particularly 
Sections 5.4, 5.6 and 12.6.4) for a more thorough discussion. 


4.2 Modeling Fortran Arrays for Verification 


Arrays in CIVL-C always have indices starting at 0 and do not support strides, 
sectioning, or reshaping. To handle the features described in the previous subsec- 
tion, each Fortran array is modeled by a CIVL-C array that is augmented with a 
recursive data structure called FORTRAN_ARRAY_DESCRIPTOR. This allows CIVL 
to model the rich Fortran array semantics using only CIVL-C language features. 
As Figure 6 shows, the descriptor stores metadata for an array instance, and 
contains the kind, rank, index upper and lower bounds and strides, as well as a 
pointer. 

When a Fortran program creates a new array from scratch, CIVL will create a 
CIVL-C array whose length is the total number of elements in the Fortran array. 
This array is then augmented with an array descriptor whose kind is SOURCE and 
whose pointer holds the memory address of the CIVL-C array. The bounds and 
stride in the descriptor are set according to those set by the Fortran program. In 
essence, the descriptor provides a mapping from the Fortran array index (which 
may be strided or non-zero-based) into the CIVL-C array index (which is dense 
and zero-based). This mapping is used by the CIVL-C program whenever the 
Fortran program accesses the array. 

If a Fortran array instance is created by reshaping or sectioning an existing 
array, no new CIVL-C array is created. Semantically, the new array instance 
in Fortran provides a view into the existing array, which we model by creat- 
ing a new array descriptor with appropriate bounds and stride whose kind is 


1 typedef struct FORTRAN_ARRAY_MEMORY *farr_mem; 

2 typedef struct FORTRAN_ARRAY_DESCRIPTOR *farr_desc; 

3 typedef enum FORTRAN_ARRAY_DESCRIPTOR_KIND { 

4 SOURCE, // A var. decl. w/ an array type or a dimension attr. 
5 SECTION, // An array section 

6 RESHAPE // An array, whose indices are reshaped w/ no cloning 
7 } farr_kind; 

8 struct FORTRAN_ARRAY_DESCRIPTOR { 

9  farr_kind kind; // The kind of a Fortran array descriptor 

10 unsigned int rank; // The rank or the number of dimensions. 
11 int *lbnd; // A list of index left-bounds for each dim. 

12 int *rbnd; // A list of index right-bounds for each dim. 

13 int *strd; // A list of index stride for each dim. 

14 farr_mem memory; // Being non-null iff kind is *SOURCE’ 

15 farr_desc parent; // Being non-null iff kind is NOT ’SOURCE’ 
16 }; 


Fig. 6. Implementation of the CIVL-C array descriptor. 
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1 int main() { 

PE ROSE ARRAYO 2 fa_desc A = fa_create(sizeof(int), 1, {{0},{8},{1}}); 

2. kgs 7 A078) 3 fa_desc __arg_A = fa_section(A, {{1},{7},{2}}); 

3 CALL SUBR(A(1:7:2)) a ae ta 7 ? anh j 

4! A: {0,1,0,2,0,3,0,4,0} Coe eor s 

S END PROGRAM ARRAYOR 5  fa_destroy(__arg_A); ! pop section descriptor 

6 6 fa_destroy(A); ! free array descriptor and data storage 
7} 

: SUBROUTINE SUBR(B) 8 void subr(fa_desc __B) { 

9 


INTEGER, 2: B(=1:0), 2:5) 9  fa_desc B = fa_reshape(__B, 2, {{-1,2},{0,3},{1,1}}); 


B(-1, 2) =1 A E a 
10 BC-1, 3) =2 10 *(int*)fa_subscript(B, {-1,2}) = 1; 
11 BCO, 2)=3 a , a 
12 BCO 3) =4 12 æ(int*)fa_subscript(B, {0,3}) = 4; 


er i 
13 END SUBROUTINE SUBR 13 fa_destroy(B); ! pop reshape descriptor 


Fig. 7. Transformation of array section and reshape operations 


set to SECTION or RESHAPE, and whose pointer stores the location of the array 
descriptor for the existing array. This new descriptor now provides a mapping 
from indices of the new array instance into indices of the existing array instance. 
Such an array section or reshaped array can itself be reshaped or sectioned by 
the Fortran program, which will result in a stack of array descriptors. Whenever 
the Fortran program accesses an array at a given index, CIVL will recursively 
use the mappings provided by the descriptors until the index in the underlying 
CIVL-C array is resolved by a descriptor of kind SOURCE. Figure 7 shows how 
some basic Fortran array operations are translated to CIVL-C using the array 
descriptor and associated utility functions. 


5 Evaluation 


The first goal of this evaluation is to determine whether CIVL correctly verifies 
or finds defects in a suite of synthetic Fortran programs that use various language 
features peculiar to Fortran. The second goal is to investigate how CIVL performs 
on Fortran code from an existing production-level HPC application. 


5.1 Compute Environment and Experimental Artifacts 


All CIVL executions were conducted on a TACAS 2022 Artifact Evaluation 
Virtual Machine (AEVM) with Ubuntu 20.04; the version of CIVL is 1.21. All 
SMACK executions were conducted on a TACAS 2020 AEVM provided by the 
authors of [13]; the version of SMACK is 1.9.1. Both virtual machines were 
deployed by Oracle VirtualBox 6.1 on a laptop running MacOS 11.6.2 on a 
2.5 GHz Quad-Core Intel Core i7 CPU with x86_ 64 architecture and 16 GB 
memory. The CIVL program and all experimental artifacts can be downloaded 
from https: //vsl.cis.udel.edu/tacas2022. 


5.2 Specification and Verification Approach 


As shown in Figure 8, CIVL primitives are inserted as structured comments for 
verifying a Fortran code, which have no effect on the normal build process. Sim- 
ilar directives exist for C. These primitives have two major kinds: type qualifiers 
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1 PROGRAM civl_primitive_example 
!$CVL $input 
INTEGER :: arg 
INTEGER :: x 
'$CVL $assume(-1 .LE. arg .AND. arg .LE. 1); 
x = arg 
!$CVL $assume(x .LT. 0); 
x = ABS(x) 
'$CVL $assert(O .LE. x .AND. x .LE. 1); 
10 END PROGRAM civl_primitive_example 
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Fig. 8. Example illustrating CIVL Fortran primitives. 


and verification statements. $input specifies that the variable in the following 
declaration is to be initialized with an unconstrained value of its type. The value 
can be subsequently constrained with an assumption statement. Alternatively, 
an input variable may be given an exact concrete value on the command line. 
Input variables are read-only. 

The $output qualifier declares a variable to be write-only. Output variables 
are used for functional equivalence verification. When two programs have the 
same input and output variables, they can be compared to determine whether, 
given the same inputs, the two programs will produce the same outputs. This 
is carried out by CIVL’s compare command, which merges the two programs 
into a single program with a new driver. The driver invokes the two programs in 
sequence on the same input variables, and then asserts that the corresponding 
outputs agree. 

A CIVL assumption statement has the form $assume (expr) ;. It is used to 
constrain the set of executions that are considered to be valid. If an assump- 
tion is violated, no error is reported; instead, the execution is ignored and the 
search backtracks immediately. $assert (expr); reports an assertion violation 
if the argument expression does not hold. This statement provides the capabil- 
ity of checking desired properties in Fortran, which has no intrinsic assertion 
procedure. All primitives must be preceded by the prefix !$CVL. 


5.3 Fortran Verification Benchmark Suites 


Our suite incorporates the 22 synthetic examples from the SMACK suite [13]. 
These examples cover basic Fortran structures ranging from expressions to func- 
tions and subroutines. The only change made is to switch SMACK-style asser- 
tions and symbolic value assignment to CIVL primitives. SMACK uses calls of 
the form assert (expr) to check desired properties, which is similar to CIVL’s 
$assert primitive. With SMACK, symbolic values are generated by calling 
__verifier_nondet_int() and assigning the result to a variable, while CIVL 
uses the $input qualifier. 

To these, we added 13 examples we created ourselves, exercising different 
language features, including argument intent specification, array sectioning, and 
boolean expressions that might lead to different results if short-circuiting is or is 
not used. We include a parallel example that uses an OpenMP for loop, executed 
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Fig. 9. Total verification time (in seconds) for CIVL and SMACK on benchmarks. 
Each time is the mean over 5 of 7 executions after dropping the shortest and longest. 


with 4 threads. Finally, we constructed 4 pairs of programs each of which can 
be compared for functional equivalence. 

The programs are listed on the x-axis in Figure 9. Where the name includes 
“fail” or “bad”, a negative verification result is expected; otherwise, a positive 
result is expected. The figure also shows the average verification execution time 
printed by CIVL and SMACK on each example. CIVL has correct results in all 
cases, while SMACK encounters exceptions or has incorrect results for some of 
the CIVL Fortran examples. Thus, the figure only reports timing results when 
the verification results are correct. 


5.4 Verifying Nek5000 Components 


Nek5000 [21] is a computational fluid dynamics code for simulating unsteady 
incompressible two- or three-dimensional fluid flow. Nek5000 has hundreds of 
industrial and academic users and won a Gordon Bell prize for its scalability on 
high performance compute clusters. 

The code contains many Fortran subroutines that perform a numerical com- 
putation that can be easily expressed in a formal way. For example, there are 
various implementations for matrix multiplication, each optimized for best per- 
formance on a particular matrix size. We use CIVL to verify that these subrou- 
tines indeed compute matrix multiplications, by showing their equivalence with 
a straightforward un-optimized textbook implementation. 

Furthermore, Nek5000 contains subroutines to numerically approximate the 
integral of a function, a process known as quadrature. Quadrature rules typi- 
cally define carefully chosen locations, known as quadrature points, at which the 
function in question is evaluated. The results are then each multiplied with a 
weight, and summed to obtain the overall integral. The quality of a quadrature 
rule is often evaluated by quantifying its order of accuracy, where a higher order 
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N Points 2 Degree 2 Violation @ encountered at depth 3244: 
Ref soln 2xAF_SIN(2) CIVL execution violation in p@ 
Quadrature 2xAF_SIN(2) (kind: ASSERTION_VIOLATION, certainty: 
Expected error: ZERO MAYBE) 


at driver_speclib_bad.f:103.6-12 
. Program Output Message .. 
'$CVL $ASSERT(DIFF .EQ. MINDIFF) 


=== Source files === aaa 
util.f (util. f) 
driver_speclib.f (driver_speclib. f) .. Detailed Violation Info .. 


speclib.f (speclib.f) 
=== Source files === 


=== Command === 
civl verify —checkMemoryLeak=false === Command === 
util. f driver_speclib.f speclib.f . 


=== Stats === === Stats === 
time (s) : 11.64 time (s) : 4.19 
memory (bytes) : 3393191936 memory (bytes) : 2587885568 
max process count Hee max process count Hee 
states : 54336 states : 4973 
states saved : 50392 states saved : 4585 
state matches : 0 state matches : 0 
transitions : 54335 transitions : 4974 
trace steps : 35239 trace steps : 3244 
valid calls : 148085 valid calls : 13662 
provers : cvc4, z3, why3 provers : cvc4, z3, why3 
prover calls : 10 prover calls ET 
=== Result === === Result === 
The standard properties hold for all The program MAY NOT be correct. See 
executions. CIVLREP/util_log.txt 


Fig. 10. CIVL output for verifying correct and erroneous Nek5000 examples 


quadrature rule yields the exact result for polynomials of a higher degree. The 
Gauss-Lobatto Legendre quadrature rules are a unique set of weights and points 
that are known to be optimal under certain conditions, and are used in Nek5000. 
We use CIVL to verify that the quadrature implemented in Nek5000 indeed has 
the claimed order of accuracy, by verifying that the quadrature is exact for poly- 
nomials with symbolic coefficients of the claimed degree. Due to its uniqueness 
properties, this also proves that Nek5000 indeed uses Gauss-Lobatto Legendre 
weights and points. 

We also seeded some of these implementations with defects and confirmed 
that CIVL reports the defects. Figure 10 shows the output from CIVL on a 
correct and incorrect example from Nek5000. Table 1 shows the verification 
results for the Nek5000 excerpts for various parameter values. The expected 
result is obtained in all cases, at modest cost (at most 12 seconds). 


6 Related Work 


Fortran has been the focus of early program verification research. One of the 
first papers on symbolic execution dealt with Fortran [8], and one of the earliest 
verification condition generation tools was for Fortran [6]. More recently, several 


Fortran static analyzers have been developed, including ftnchek [19], Cleanscape 
FORTRAN-Lint [9], and FORCHECK/Coverity [35]. These tools detect certain 
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Name LoC Result Scale Time States 
speclib 560 ‘True 2<NP<2;2<DEG<3 5.14s 10857 
speclib 560 ‘True 2<NP<3;2<DEG<5 12.08s 55908 
speclib_bad 560 False 2<NP<2;2<DEG<3 4.67s 6011 
speclib_bad 560 False 2 < NP < 3; 2 < DEG < 5 4.27s 3223 
mxm_unroll 458 Eqv 3x3 5.49s 26867 
mxm_unroll 458 Eqv 4x4 8.51s 59914 
mxm_unroll_bad 458 NEq 3x3 5.48s 26865 
mxm_unroll_bad 458 NEq 4x4 8.56s 59912 
mxm_pencil 458 Eqv 2x2 5.83s 9264 
mxm_pencil 458 Eqv 3x3 7.38s 26893 
mxm_pencil 458 Eqv 4x4 10.14s 59968 
mxm_pencil_bad 458 NEq 2x2 6.01s 9262 
mxm_pencil_bad 458 NEq 3x3 7.53s 26891 
mxm_pencil_bad 458 NEq 4x4 10.48s 59966 


Table 1. Results of verifying Nek5000 code excerpts at various scales 


pre-defined generic defects, such as variables that are read but never written, un- 
used variables and functions, and inconsistencies in common block declarations. 
They do not allow one to specify and verify functional correctness properties. 

Other tools use dynamic analysis (or a combination of static and dynamic 
analysis) to check such generic properties. One example uses the PIPS compiler 
to detect forbidden aliasing in subroutines [22]. The NAG Fortran compiler can 
also insert checking code to catch many defects at runtime [29]. 

In contrast, CamFort [27] implements a lightweight specification and static 
analysis approach. The user annotates the Fortran program with comments in a 
domain specific language for specifying array access patterns (stencils) or asso- 
ciating units of measurements to variables. CamFort, which is written in Haskel, 
parses the code, constructs an AST, and verifies conformance to the properties 
using Z3. This approach strikes a balance between the generality of program ver- 
ifiers such as CIVL, which can specify arbitrary assertions in a general purpose 
assertion language, and the more tractable static analysis tools. 

Several tools have been developed to translate Fortran to other languages. 
These include f2c [11] (which translates to C) and Fable [14] (C++). In addition 
to the issues discussed in Section 3.1, the potential of these tools as front ends for 
verifiers is limited by the fact that the translated code is often considerably more 
complex than the original or involves complex libraries which the verifier must 
also understand. It should be noted that Fable’s approach to modeling Fortran 
arrays is similar to ours in that it defines a class that bundles a reference to the 
data with meta-data describing the “view” of the array. 

A number of verification tools work off of the LLVM compiler’s low-level in- 
termediate language, LLVM IR. These include SMACK [30], Divine [2], LLBMC 
[34], and SeaHorn [15]. In theory, this should allow one to chain together any 
of the many compiler front ends that generates LLVM IR with a general LLVM 
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IR verifier. In practice, this is very difficult, and most of these verifiers accept 
only a subset of LLVM IR generated by a particular front end from a partic- 
ular source language—usually C or C++ [13]. To the best of our knowledge, 
only SMACK has been applied to Fortran [13], using the Flang front end [12]. 
However, the subset of Fortran accepted and the example codes themselves are 
small. A more significant concern, discussed in Section 3, is that a front end 
may “compile away” defects in the source program by choosing one of several 
acceptable ways to translate a construct with unspecified behavior, or assuming 
the absence of undefined behaviors. 

In this work we have translated Fortran to the intermediate verification lan- 
guage (IVL) CIVL-C. Other, more widely-used, IVLs include Boogie [3] and 
Why3 [5]. Among these languages, CIVL-C stands out for its robust support for 
pointers and concurrency, which simplifies much of the modeling effort. 

The CIVL verifier analyzes a CIVL-C program using symbolic execution, a 
widely-used technique for test-case generation and verification. Other mature 
symbolic execution tools include KLEE [7] (for C programs, via LLVM) and 
Symbolic PathFinder [23] (for Java byte code). 


7 Conclusion and Future Work 


We presented a Fortran extension to CIVL, a novel model-checking approach 
that preserves and reveals defects in source code written in Fortran. Compared 
with compiler-based verifiers, this tool parses and analyzes source programs from 
a verification perspective. In doing so, it mitigates against the risk of missing 
defects that are eliminated via legal but non-defect-preserving compiler opti- 
mizations. 

The extension includes a data structure and associated algorithms for de- 
scribing Fortran array metadata and tracking complex array transformations. 
This method of handling Fortran arrays could be adopted by other verification 
tools. The extension also supports a set of CIVL verification primitives which 
can be introduced into Fortran programs as structured comments. 

Evaluation results show that our tool performs correctly and quickly (com- 
pared to previous work) on a range of synthetic benchmarks and some kernels 
extracted from real world applications. In the future, we plan to enlarge the 
supported subset of Fortran language features and to enhance support for veri- 
fying Fortran programs with OpenMP directives. The resulting CIVL extension 
is expected to cover the DataRaceBench [36] suite, including both the C and 
Fortran examples. 
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Abstract. We present NORMA, a tool for the modeling and analysis of 
Relay-based Railways Interlocking Systems (RRIS). NORMA is the result 
of a research project funded by the Italian Railway Network, to support 
the reverse engineering and migration to computer-based technology of 
legacy RRIS. The frontend fully supports the graphical modeling of Ital- 
ian RRIS, with a palette of over two hundred basic components, stubs 
to abstract RRIS subcircuits, and requirements in terms of formal prop- 
erties. The internal component based representation is translated into 
highly optimized Timed NUXMv models, and supports various syntactic 
and semantic checks based on formal verification, simulation and test 
case generation. NORMA is experimentally evaluated, demonstrating the 
practical support for the modelers, and the effectiveness of the underlying 
optimizations. 


Keywords: Relay-based Railway Interlocking Systems - graphical mod- 
eling - model checking 


1 Introduction 


Railway interlocking systems (RIS) are complex signaling apparatus that prevent 
conflicting movements of trains through an arrangement of tracks, most notably 
stations. The basic requirement is that a signal to proceed is not displayed unless 
the route to be used is proven safe. This means positioning the switches in the 
appropriate position, controlling the level crossings, and setting the aspects of 
the signals to indicate the expected speed restrictions. 

Although the world is slowly migrating to computer-based RIS, the predomi- 
nant solutions are still based on electromechanical technology, where the logic of 
the interlocking procedures is encoded in the evolution of the status of the circuit 
relays. RRIS are a costly and hard to modify technology. Yet, RRIS have been 
working correctly and safely for decades. In the migration from relay-based to 
computer-based RIS, it would be a natural choice to use RRIS as golden require- 
ments for the new implementations. However, they are de facto legacy systems, 
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whose behavior is known to a handful of highly specialized domain experts, hence 
expensive to maintain and update. Even more, the RRIS schematics are often 
available only in printed form, and their behaviour needs to be manually sim- 
ulated. Understanding RRIS, modeling them in digital format and extracting 
requirements from them is hence a major challenge in the migration process. 

In this paper we present NORMA, a real-world tool for the formal modeling 
and analysis of RRIS. NORMA is the result of a research project funded by 
the Italian railway network company (Rete Ferrovie Italiane - RFI), within a 
process of reverse engineering and migration to computer-based technology of 
the legacy RRIS currently in operation [6]. NORMA leverages formal verification 
techniques to provide extensive support for modeling, debugging, understanding, 
traceability and verification. This also enables simulation, testing, and properties 
extraction from RRIS. 

The NORMA frontend fully supports the graphical modeling of the Italian 
RRIS, with a palette of over two hundred types of configurable components, 
both on direct and alternate current, and single- and double-wired convention; it 
allows the use of stubs to abstract RRIS subcircuits, and to specify requirements 
in terms of formal properties. 

Given that RRIS often contain thousands of component instances, the task of 
manually modeling RRIS is repetitive and error prone. To support the modeler, 
NORMA supports a modeling style where components are accurate at the elec- 
trical level, so that the digital model is in a one-to-one correspondence with the 
printed schematic, and no manual abstraction is needed. Furthermore, a number 
of syntactic and semantic checks ease the debugging of the models. In fact, while 
RRIS are operating correctly, errors may be introduced in the modeling process. 

RRIS graphical models are internally represented with suitable data struc- 
tures, and automatically formalized in NUXMvV as symbolic timed transition 
systems over Boolean and real-valued variables. Then, a rewriting pipeline im- 
plements several domain-specific simplification steps that take into account the 
features of the RRIS to produce a drastically reduced model. Notable simpli- 
fications in the pipeline include the identification and inlining of functionally 
dependent variables, and contextual determinization of unconstrained signals. 

The simplified model is then amenable for simulation, checking of invariant 
and temporal properties, and test case generation, in addition to providing a 
number of semantic checks deriving from built-in properties. 

NORMA is based on an extensible software architecture. It is built on the DIA 
toolset, and follows a library-based approach, where each component is modeled 
and tested in isolation. In turn, the component library is built by means of an 
automated process relying on configuration tables. At the core, the verification 
process is carried out by the NUXMV model checker. 

NORMA is actively being used within RFI. We experimentally evaluated its 
capabilities on real-world RRIS schematics with thousands of variables, with sev- 
eral important findings. First, it is very effective in supporting the modelers: the 
semantic checks proved to be invaluable to pinpoint several subtle modeling er- 
rors. Second, the underlying optimizations dramatically reduce the computation 
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time of the verification tasks. Third, NORMA supports the automated extraction 
of specifications, such as the table of mutually incompatible routes encoded in 
the RRIS of a medium-sized station. 

To the best of our knowledge, NORMA is the only tool supporting the mod- 
eling and the formal analysis of real-world RRIS. It integrates behind a graph- 
ical front-end some powerful reasoning capabilities, without exposing domain 
experts to the intricacies of formal verification. Approaches to the formal anal- 
ysis of RRIS have been proposed [5,12,11,15]. However, the RRIS is modeled 
at a high level of abstraction, so that important features are lost. More impor- 
tantly, the user is in charge of ensuring the correspondence between the circuit 
schematics and the formal model. Given the typical size of real-world RRIS, the 
process appears to be very prone to errors. In contrast, we rely on a comprehen- 
sive approach to component-based modeling [8], where the RRIS is described 
as a multi-domain switched Kirchhoff network, hence supporting a precise and 
electrically-accurate semantics. 

This paper is structured as follows. In Section 2 we present the background 
domain of RRIS. In Section 3 we overview the functions of NORMA. Then, in 
Section 4, 5 and 6 we discuss in detail the front end, the compiler and the 
simplifier. In Section 7 we overview the software architecture of NORMA, and 
in Section 8 we present the experimental results. In Section 9 we draw some 
conclusions and discuss the future developments. 


2 Relay-based Railway Interlocking Systems 


At the beginning of the 20°” century, the rapid growth in the development of rail- 
ways systems called for technological solutions to avoid collisions among trains 
and other safety critical issues. Signals were originally installed at fixed track 
side positions, featuring mechanical arms which were manually operated through 
levers, pulleys and wires from local signal boxes. As purely mechanical devices 
proved soon to be very unreliable, they were substituted by electric and elec- 
tromechanical devices, like for example signals with colored lights and railroad 
motor switches. Aside from being much more reliable and economically sustain- 
able, these new devices could now be controlled remotely in a centralised fashion. 
The control procedures went from manually operating each device individually, 
to logics able to operate automatically multiple devices at once, for example 
to safely create and monitor an itinerary for a train to leave a station. These 
centralized logics were mainly based on relays and proved to be able to operate 
reliably for decades. 

A relay is an electromechanical element, generally composed by a coil and one 
or more contacts: when the coil is traversed by sufficient current, it generates a 
magnetic field that will close or open the contacts depending on the relay type. 
When the current flow is interrupted, the contacts will return to their initial 
state. 

By combining relays in circuits it is possible to implement a sequential logic, 
where the combinational part is made of series/parallel circuits of relay contacts, 
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Fig. 1. Extract of schematics of itinerary from Italian legacy RIS relay logic 


and the memory part is encoded in the relay coil state, i.e. being powered or not. 
Inputs from the environment of such logic are electrical signals coming from the 
rail track (e.g. train pedals), or from the user-interface (e.g. buttons or levers). 
Outputs to the environment are electrical commands to the plant (e.g. power to 
a rail crossing motor) or to the user-interface (e.g. light bulbs that represent a 
signal status). 

The general concept of relay circuit logic is specialized in the solutions 
adopted in the Italian railway network. In such a domain, circuits are represented 
as schematics in separate sheets, along with informative material like topological 
schematics of the track and devices controlled by the relay logic, tables, tex- 
tual notes, etc. Circuits are made of interconnected components. Components 
have terminals, and terminals are connected by lines representing electrical con- 
nections. Some components have an associated name to represent the relation 
between a coil and its contacts. 

The domain has several interesting characteristics. The first one is the com- 
plexity of the domain: there is a large number of components types that differ 
on the timing required to operate, the amount of memory elements that can 
be stored, and so on. In particular, base components like coils, contacts, levers, 
loads, etc. can be combined with zero, one or more specifiers to specialize the 
components behaviour. This combination leads to more than 5000 components 
types which can be instantiated in a circuit. As an example, there exist dozen 
types of relay which are characterized by being delayed or not (when activating, 
deactivating, or both), polarized or not, single or double coil, stabilized or not, 
etc. 

Second, the circuits can be operating either with direct or alternate current, 
where discrete signals (e.g. the maximum allowed train speed in a track segment) 
are encoded by means of frequency and/or amplitude modulation. Some compo- 
nent types can generate modulated current, and some corresponding component 
types can read it and react accordingly. 

Third, several design conventions were adopted for the sake of readability of 
the relay logic circuits. Three significant cases are: (1) the logical representation 
of components in circuits, (2) single/double-wired circuits and (3) units. Circuits 
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are logically represented (1) in schematics, in contrast with the physical represen- 
tation found in conventional electrical schematics. In such logical representation, 
relay coils and their contacts are represented in separate circuits, dislocated ac- 
cording to logical criteria, as the coil of a single relay and its contacts may indeed 
belong to different logical functions. Like in computer programs, where messages 
can be sent and received from different logical units, coils can be activated (mes- 
sage sent) and contacts react accordingly (message received) in separate logical 
units. Separation of coils and circuits in the schematics helps preserving logi- 
cal cleanness and may improve the readability of the schematics. The relation 
between a coil and its contacts is pragmatically kept by using the same name. 
See for example Fig. 1, where the coil of relay “F” (bottom left) and some of its 
contacts appear in the same circuit. 

In single-wired circuits (2), an electrical connection line between two com- 
ponents implicitly represents a pair of wires, i.e. the current flows in one direc- 
tion through one wire in the implicit pair, and returns through the second wire 
in the pair. The single-wire representation is frequently used as it is practical 
and readable. However, sometimes it is needed to represent explicitly the two 
wires, e.g. when a contact needs to cut explicitly the return wire. Sometimes the 
single/double-wired representation is mixed in the same schematics or even in 
the same circuit. 

Units (3) are a pragmatic way of reusing parts of a schematics. A unit repre- 
sents a generic functionality, e.g. the logic which can control a single rail switch. 
A unit is a set of circuits associated with a name. Other circuits can refer (in- 
deed so instantiating the unit) to a set of components which a unit contains, by 
using the same component names along with the unit’s name as namespace. For 
example, in Fig. 1, all contacts “H” belong to the unit “UGB92”. 

As a final remark, consider that the schematics of relay circuits are legacy, 
available in terms of large printouts. They have been designed along many 
decades, with new features added incrementally and often monotonically. This 
makes the logic of a medium sized station interlocking very large, containing 
thousands of components spread over dozens of AO sheets. 


3 NORMA: overview 


NORMA is a tool to model, trace, understand and analyze RRIS used in the 
Italian railway network. The ultimate goal of NORMA is to support the under- 
standing and the reverse engineering of RRIS. This demands that all original 
schematics in printouts get correctly digitalized into formal models that are 
amenable for automated verification, e.g. by a model checker. 

In the workflow supported by NORMA (Fig. 2), there are three main working 
lanes: Modeling, Traceability and Analysis. 

Modeling supports the graphical digitalization of formal models. Traceability 
keeps the links between the created models and the corresponding requirements 
and regulatory documents. Analysis supports the verification of properties about 
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Fig. 2. High-level workflow with NORMA 


the RIS. These lanes are described further in Section 4 (Graphical Modeling) 
and Sections 5 and 6 (Analysis). 

The activities involved in these lanes are performed by different users oper- 
ating as a structured team: the administrator inserts input artifacts to enable 
modelers and analyzers to work on them. Each modeler works on schematics 
areas exclusively, and their contributions are merged by the administrator into 
the project. This avoids any risk of conflicts. 


Modeling and Traceability Modeling is enabled by an administrator that creates 
a project, adds images of RRIS schematics to be modeled, adds regulatory docu- 
ments for traceability, and commits the modifications to a centralized repository 
which all enabled users can access. Modelers can then checkout the project lo- 
cally and graphically model components (picked from a palette), connections, 
units and all other parts that are relevant for the formal analysis. As the size 
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and number of schematics is very relevant, NORMA enables in its architecture 
the integration of an image classifier /recognizer to (semi)automatically recognize 
components and connections among them. This feature is currently under eval- 
uation and not yet deployed. The modeler can also select regions of the modeled 
RRIS and associate those regions with parts in regulatory documents describ- 
ing the requirements that the selected RRIS covers. The modeler can check the 
model against a set of syntactic rules (e.g. there are no floating connections) and 
iterate, then they can commit the local modifications to the remote repository 
for the administrator evaluation and admittance of the contribution. 


Analysis The formal core of NORMA is based on a compiler that transforms a 
graphical model into a formal model in SMV language, that can be processed by 
NUXMV model checker. The compiler picks components from a SMV library of 
timed automata with real-valued variables, each corresponding to a component 
type in the graphical model, and composes the networks accordingly to the 
electrical and logical connections among them. Stubs are SMV modules that 
Analyzer exploits to model abstracted parts of the RRIS which do not appear 
in the graphical model. The compiler injects and connects the stubs directly in 
the generated SMV model. The result of the compilation is a model whose size 
is directly related to the size of the modeled RRIS. In order to ease the formal 
verification process, the SMV model passes through a conservative simplification 
process that can dramatically reduce its size. Model checking is finally carried out 
by expressing LTL or invariant properties and using the NUXMV model checker. 


4 Graphical modeling of RRIS 


A NORMA project is defined as a set of Documents and RRIS, with meta in- 
formation to link them. The key idea is to allow modelers to draw the digital 
schematic on top of the original RRIS image. Hence, the RRIS within a project 
are structured in layers , where contents in higher levels hide the content in lower 
ones. Layering enables a clear separation between different types of elements, and 
supports the modelers during the digitalization process. 

There are 5 layers: Original RRIS, holding the image of the original RRIS 
schematics; Masks, used to hide sections of Original as soon as they get modeled; 
Modeled Components, the main working layer where modelers put components 
and connections; Units layer, holding named polygons modeling Units; Trace- 
ability, holding several types of tracing information. By hiding/showing a layer, 
the modeler is able to focus on specific parts of the RRIS, e.g. to identify those 
parts that still need to be modeled. 


Modeling mainly consists in placing components, connections and units in the 
diagram. The components palette (Fig. 3) has a central role in this process. As 
there exist such a large number of components, a customized interactive palette 
was designed to substitute the default, flat palette which could not be effectively 
used. The modeler is guided through the process of picking and configuring the 
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required component and specifiers matching the RRIS section being modeled. 
The selection begins from the typology (coils, contacts, load, etc.) and contin- 
ues in interactive steps to allow the characterization of specifiers, single/double 
wiring, number of terminals, parameters, flipping and rotations, etc. The effec- 
tiveness of the process is increased by means of domain-specific constraints to 
restrict the user choices. These constraints are automatically generated, as de- 
scribed in section 7. Connections and junctions are also customized with respect 
to the default modeled style. 


Traceability aims at linking requirements, found in regulatory documents, to the 
fragments of RRIS implementing them. NORMA allows the modelers to select 
texts and images in PDF documents, and to select regions of the model. Each 
selection is given automatically an ID, and IDs can be linked at project level to 
keep traceability information. 


Utilities are made available to help the modeler. It is possible to search through 
traceability data and models, for example to search components by name or by 
type, or to search the coil corresponding to a given contact. Syntactic checkers 
can be run to spot errors or other issues like for example missing or wrong param- 
eters, missing components (e.g. a contact without a corresponding coil), missing 
connections, etc. Selecting an issue moves the focus to the specific location in 
the model. 


5 Compilation in Timed SMV 


A RRIS is stored internally as a set of bipartite graphs {G;(T, N,W)}i<n where 
each G; represents a circuit, i.e., a set of components terminals (T) connected to 
junctures — or nodes — (N) by wired edges (W). 
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The compilation converts such a description into an symbolic, infinite-state 
timed automaton specified using (timed) SMV, the language of the NUXMV 
model checker. The hierarchical structure of SMV modules enables us to produce 
specifications that directly reproduce the structure of the starting schematics; 
combined with a library of component models, we fully leverage the advantages 
of the compositional approach of Multi-Domain Switched Kirchhoff Networks 
(MDSKN) described in [8]. 


SMV library of components. The SMV library of components consists of 
41 different formal models. Most of them depend on one or more parameters, 
which are automatically instantiated at network generation time, based on the 
parameter choices in the RRIS. The values of such parameters are either supplied 
directly by the user or automatically selected according to the component role 
in the network. 

In some cases, different electrical components are mapped to the same SMV 
model; this happens when the differences between the components are not rel- 
evant for their electrical modeling (e.g. in case of manufacturing differences be- 
tween relays). 

Single-wired schemes are translated into equivalent double-wired schemes by 
a preliminary pass. After this, only double-wired components are considered, so 
that no connection is left implicit in the resulting formal model. 

The interface between components is realized by means of a special terminal 
module. This module defines a pair of electrical variables ii and vv representing, 
respectively, the current and the voltage at the terminal, corresponding to flow 
and effort in the MDSKN framework [8]. 

In addition to electrical connections, there are logical connections between 
components, e.g. between a relay R and its contacts. To handle this kind of 
connections, the models in the SMV library are divided in two classes: master 
and slave components. 

The SMV model for a master component exports the appropriate state vari- 
ables (e.g. the activation status of a relay) that trigger a corresponding action 
in its slaves. The SMV model for a slave component is characterized by the 
presence of one or more parameters, which play the role of external inputs for 
the component. The connection between these inputs and the correct master 
outputs is then resolved when the models are composed together to form the 
final network, as explained in the next subsection. 

Compared to the approach described in [8], that relied on a network of hybrid 
automata, supporting arbitrary continuous dynamics, here we consider networks 
of timed automata (extended with real variables), whose only continuous evo- 
lution is based on the standard clocks of the form ¢ = 1. In fact, the domain 
experts pointed out that a precise modeling of transient states (e.g. in RC circuits 
implementing an activation delay for a relay) is not necessary for the correct de- 
scription of the RRIS and could be safely approximated with timing constraints. 
Hence, we adopt a modeling style where the continuous dynamics are replaced 
by a set of discrete transitions happening within a constrained time interval. 
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While being sufficient for all practical purposes, it supports a more adequate 
synchronous composition and makes the verification task easier. 


Circuits composition. The SMV model C for a circuit G(T, N, W) is obtained 
as the synchronous composition of the models for its components, with additional 
constraints representing both wired and logical connections. 

For each wired juncture in N, connected to terminals t;,...,t,, we add to 
the invariant of the circuit the Kirchhoff conservation of current law t;.ii-++-++ 
tķ.ii = 0, and the equality of potentials law t;.vv = --- = ty.vv. Since all the 
components expose the same interface of current and potential variables at the 
terminals, this composition step is component-agnostic and localized to a single 
circuit. 

Logical connections, instead, require to resolve the correct binding between 
master outputs and slave inputs which corresponds to the configuration of the 
graphical specifiers used. Then, the master output is passed as input parameter 
to the slave component’s module. 

The high level topology of the network is defined by a graph M = ({C;}, R) 
showing the logical connections between circuits. Namely, an oriented edge (C;,C;) 
belongs to R iff there exists a component in circuit C; which is the master of a 
component in circuit C;. The network topology may have cycles: a master com- 
ponent may be associated to a slave in the same circuit, therefore inducing a 
self loop in the graph NV. In order to preserve the causality of the events, it is 
important to model the remote action of a master on its input with a transition 
in the SMV model. More specifically, in every master we should use an urgent 
transition relation which delays the master output signal to the next state, in 
which it will be actually read by the slave. The analysis of M allows for an opti- 
mization of the SMV module for the network, which aims to shorten the paths 
of the resulting transition system. Namely, we insert the minimum number of 
delayed masters needed to break cycles. 


Stubs and Assumptions. The RRIS network may have dangling inputs in the 
switches which respond to the status of a lever or a button controlled by a human 
operator. In order to have a self contained closed system, we explicitly model 
the environment module € which includes the models of the external masters. 

In order to support a localized analysis of the RRIS, focusing only on a subset 
of the circuits in the network and abstracting the others, NORMA supports the 
addition of a stub module S, which includes the masters belonging to removed 
circuits. 

Both the environment and the stub modules can be used to model assump- 
tions on the language of the inputs read by the network. As an example, when 
modeling the inputs coming from the trackside, we can add nominal or faulty 
assumptions on the behaviour of the signals. Such hypotheses can be provided 
by the user directly as SMV constraints. Furthermore, in order to ease the task 
for railway experts (e.g. having to define sophisticated or repetitive assump- 
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tions), NORMA also supports the automatic translation of such constraints from 
standardized Excel spreadsheets. 


6 Simplification of RRIS models 


NORMA contains a simplifier to reduce the number of variables, especially the 
real-valued ones, in the SMV models of the RRIS produced by the compiler. The 
steps of the simplifier are conservative: the optimized model is equivalent to the 
previous one with respect to the observable variables, i.e., the ones which can 
be included in a property to verify. 


6.1 Equivalence propagation 


Due to the compositional approach, the variables involved in the invariants of 
the circuits are highly redundant. Current and voltage variables are exposed by 
all component terminals and are strongly interconnected by Kirchhoff laws. For 
this reason, the first simplification step tries to reduce the number of variables by 
inlining equivalences. Namely, we clusterize the real variables into equivalence 
classes, propagating the equivalences that can be inferred syntactically from 
atoms of the form x = y or x + y = 0. Variables are substituted with a unique 
representative element for their equivalence class, and clusterization is repeated 
until fixpoint. 


6.2 Abstracting electrical variables 


The variables occurring in the invariant of each module of the network (cir- 
cuits, environment or stub) can be classified exploiting the information about 
the topology as follows. 


— Input variables I: boolean variables possibly defined in other circuits used to 
model the open / closed condition of a switch. Each configuration of these 
variables correspond to a discrete mode of the circuit. 

— State variables including: real-valued clock variables C, used to model timers 
inside components; history boolean variables H, needed in some component 
to keep track of the previous state; all the real-valued electrical variables Æ 
used for the values of current and voltage in each terminal. 

— Output variables including: boolean variables representing the exposed mas- 
ter outputs Q, such as the status of a coil; real-valued probes P. Probes are 
added to the circuit to pin the points in which one would like to read the 
values of current or voltages. 


We simplify the model by removing the electrical variables which are only 
needed to establish a binding from the switches to the relays and probes. As 
a matter of fact, the input and output variables are the only observable ones, 
i.e., they can be used in the specifications to verify. This simplification step is 
based on the fact that electrical variables do not evolve during timed transitions. 
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Pseudocode 1 Removal of electrical variables from the network 
function REMOVE-ELECTRICAL-VARS(V = ({C;}, R)) 
for all C; do 
classify vars in Invar;(I, C, H, E, Q, P) 
Invar; := DETERMINIZE-VARS(Invar;); 
Invar; := QUANTIFY-ELECTRICAL-VARS(Invar;); 


function DETERMINIZE-VARS(Invar(/, C, H, E, Q, P)) 


if sat (Invar A Invar[Q / Q’] A (Q 4 Q’)) then > Check boolean outputs 
throw error: Q variables are non-deterministic. 
w := Invar; A Invar;[E / E',P / P’| > Check real vars 


for all x € (EU P) do 
if sat (Y A (x Æ x')) then 
throw warning: variable x is non-deterministic 
if x € P then 
Invar := Invar ^ GET-DEFAULT-CON (£) 
return Invar 
function QUANTIFY-ELECTRICAL-VARS(Invar) 

Defc := {ba(c) + a(c) | c € C, a(c) € atoms(Invar) } 
Defp := {bip=v) © (p = v) | p € P,v € GET-VALUES(Invar, p) } 
Invar’ := Invar A Defc A Defp 
@ := qelim (JE, C, P . Invar’) 
Invar (I, C,H, Q, P) := $[b(p=v) / (P= v) abate / a(e),.--] 
return Invar 


In fact, the continuous evolution of currents and voltages (e.g., the exponential 
dynamic of the charging process of a capacitor) is pragmatically abstracted in 
the modeling of the components using clock variables, which define lower and 
upper time bounds connecting two stationary conditions (e.g. no vs full charge). 
It follows that the electrical variables should be uniquely determined by the 
discrete modes, and that the outputs (Q and P) are function of only the inputs 
I, clocks C and history variables H. 

Function REMOVE-ELECTRICAL-VARS reported in Pseudocode 1, firstly checks 
and solves the non-determinism of the outputs, then quantifies out the electrical 
variables. Observe that circuits can be analyzed independently and in parallel. 


Solving non-determinism. In a purely theoretical setting, Kirchhoff laws 
may actually allow unconstrained electrical variables. For example, current can 
separate non-deterministically in a branching node if the wires of a loop have 
no load; similarly, the voltage of a terminal which is neither connected neither 
to a ground nor to a power source is non-deterministic. Non-determinism of real 
variables is an issue only if variables in P are affected, e.g. in case we verify 
that an unpowered terminal has always a null value. Nonetheless, reporting all 
the non-deterministic real-valued variables (including the ones in E) is a very 
important checker for the validation of the model: if many non-deterministic 
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variables are found in a circuit, then it is possible that a connection has been 
missed during the manual modeling. 

Function DETERMINIZE-VARS checks that the boolean outputs are uniquely 
defined and reports to the user the set of non-deterministic real-valued variables. 
For non-deterministic probes, it also enriches the invariant assigning a default 
unique value (e.g., the null value for potentials) for the configurations of inputs 
in which they are under-specified. 


Removing electrical variables. We want to compute Invar (I, C, H, Q, P) = 
JE . Invar, where the real-valued variables C and P are preserved. In order 
to avoid a possibly expensive geometric projection [14], function QUANTIFY- 
ELECTRICAL-VARS initially builds a boolean encoding for them, that enables the 
use of more efficient All-SMT quantifier elimination [13]. 

For clock variables we add a boolean variable b,(-) for every linear constraint 
a(c) € atoms(Invar) involving a c € C. 

Probe variables, instead, occur in atoms mixed with other electrical variables 
that are to be quantified. Thanks to the previous determinization step, we know 
that each variable p € P can assume a finite number of values, induced by the 
finite configurations of the inputs. Therefore, a boolean encoding for p is obtained 
by enumerating the values that it can have in Invar with function GET-VALUES. 

After the projection is performed on a purely boolean target, the original C 
and P are recovered by substituting the boolean hooks with the corresponding 
atoms. 


7 Software architecture 


At system level NORMA interacts with several entities (Fig. 4). NORMA is built 
on top of Dra [2], a mature open source program for drawing diagrams. To han- 
dle the interactions with the remote repository, NORMA uses GIT [1] and GIT- 
LAB [4]. To perform the SMV model simplification task NORMA uses NUXMV [7] 
and PYSMT [10]. The traceability block is implemented by using POPPLER [3] 
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as backend for visualizing and annotating PDF documents. The domain-specific 
palette is automatically generated out of a set of tables filled in by the do- 
main experts . These tables contain logical constraints over base components 
and specifiers, such as their graphical aspects, compatibility matrices, physical 
properties, admissible terminal configurations, orientation constraints, etc. Base 
components and specifiers are combined into a generated palette of over five 
thousand component types that can be directly read by DIA. 

NORMA is the result of about 2 years development, with a team of 2 to 4 
person/year. Extensions made on top of DIA are implemented in C and Python, 
for a total of about 20KLOC. The compiler is implemented in Python and counts 
about 1OKLOC. NORMA is developed with an agile process, adopting continuous 
integration, feedback and testing from domain experts, and formal verification of 
the model library. NORMA is a proprietary software and is currently not licensed 
to third parties. 


8 Experimental Evaluation 


NORMA has been used for several months by a team of 3 domain experts to 
model several schematics of the Italian railway logic. For this experimental eval- 
uation, we consider two reference schemas: r-switch and routes48. r-switch 
is a complete RRIS controlling a railway switch; it represents a general net- 
work which is replicated and connected to other schemas. routes48 describes 
the route formation for a medium sized interlocking station, with 48 shunting 
routes. Each route is associated to a button and can be enabled/disabled by a 
human operator. 


Modeling support. The considered schemas include thousands of interconnected 
components and several units. The modeling activity in NORMA took approxi- 
mately 1 person/week for each RRIS. Several further iterations were required. 
The checkers implemented within the graphical interface were able to immedi- 
ately report to the user syntactic errors, like dangling terminals or misspelled 
names. Subsequent analyses, like the classification of the circuits variables and 
the checks of deterministic outputs, pinpointed several errors that had been 
missed by the syntactic checks: swap of identifiers between components (result- 
ing in wrong logical connections between master and slaves), missing connec- 
tions in n-ary junctures, and wrong initial condition for switches. The fixes were 
validated by examining simulations and by verifying basic properties on the 
corresponding SMV model. 

Stubs and assumptions also proved very useful in modeling. RRIS routes48 
is connected to 11 railway switches, which have been excluded from the model- 
ing and abstracted by a stub. Assumptions on their behaviour were expressed in 
tabular format, and automatically imported in NORMA by way of a dedicated 
conversion module. Assumptions were also used to specify typical scenarios con- 
straining the actions of the human operator. RRIS r-switch is attached to a 
stub which abstracts the behaviour of some physical entities in both nominal 
and faulty modalities. 
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Table 1. Effects of the simplifications on the number of real-valued variables and 
times (in seconds) spent in each phase. 


simplifier 

compiler load inliner E removal 

RRIS comps circuits #bools #reals time time #reals time #reals time 
r-switch 185 16 94 1362 2 7 325 6 6 26 
routes48 691 13 580 8397 14] 205) 1467 188| 1170 21 
routes02 51 T 40 642 2 2 108 2 5 1 
routes04 95 8 74 1220 3 5 214 5 8 4 
routes06 149 9 102 1935 3| 12 337 11 124 12 
routes12 210 9 139 2629 3| 20 473 19 347 3 


Effectiveness of Simplifications. In Table 1 we evaluate the impact of the sim- 
plification steps on the analyzed RRIS, including a set of handcrafted scaled 
versions of routes48, which control a reduced number of routes. 

Column “compiler” shows the features of the obtained SMV models in terms 
of the number of components, circuits, boolean and real-valued variables, to- 
gether with the time spent for the compilation. Column “load” reports the time 
spent for the untiming (with timed NUXMvV [9]) and the conversion to PYSMT 
formulae. Column “inliner” shows the effects of the propagation of equivalences, 
which drastically reduces the number of real variables. Column “E removal” cor- 
responds to both the determinization and the quantification of the real variables. 
While the determinization check is always performed, in these experiments we 
heuristically enable the removal of electrical variables circuit-wise, depending on 
the number of input variables. As a result, the un-needed electrical variables were 
fully removed only in the smaller circuits. In RRIS such as r-switch, routes02 
and routes04, we obtained a simplified model where the left real-valued vari- 
ables are only clocks and probes. Finally, observe that the reported time for this 
simplification steps corresponds to the sequential analysis of each circuit, which 
are independent of each other and could be parallelized. Despite this, the per- 
formance of the tool was considered to be adequate, given the strong support in 
pinpointing modeling errors. 


Verification. We verified the obtained SMV models against a set of domain 
dependent specifications. For RRIS r-switch we consider 16 safety properties 
describing how the switch changes in response to commands, for both the nom- 
inal and faulty stub modalities. RRIS routes48 implements a controlling logic 
which avoids that two incompatible routes are enabled simultaneously: incom- 
patibilities are checked with a system of lockings of the railway switches shared 
by concurrent routes. An incompatibility table represents which pairs of routes 
can be sequentially activated. The table is not symmetric, as the incompatibility 
relation depends on the activation order. Thanks to the simplifications, we were 
able to model check all the 2256 entries, proving both safety (incompatible pairs) 
and liveness (compatible pairs) properties. 

Fig. 5 shows the impact of the simplification steps on the model checking 
times for the verification of r-switch (on the left hand side) and routes48 (on 
the right hand side). For the latter, we consider 105 queries tackled with IC3 
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Fig. 5. Effect of the simplifications on the model checking times. 


and 91 queries tackled with BMC under 1 hour timeout. Only 17 IC3 instances 
encountered this threshold with the simplified model, against the 50 time outs 
of the original model, for both IC3 and BMC. 


9 Conclusions 


In this paper we presented NORMA, a tool for the modeling and formal analysis 
of relay-based railway interlocking systems (RRIS). NORMA allows to graphically 
represent the wide class of RRIS of the Italian railway network, and provides 
various checks to ease the task of the modeler. Furthermore, it provides an op- 
timized compilation to the input language of timed NUXMV, that converts the 
RRIS into a symbolic infinite-state timed transition system. This enables sim- 
ulation and effective model checking of temporal properties. The experimental 
results clearly demonstrate the effectiveness of the simplification techniques. The 
tool is also shown to provide strong feedback to the user to support debugging 
in the modeling process. 

NORMA is being extensively used within RFI. Despite the support provided 
to the modelers, the sheer size of real-world RRIS results in a very high human 
modeling effort. We are currently experimenting with deep learning techniques 
to automate — at least partially — the modeling step. In this setting, the formal 
analysis capabilities will be fundamental to detect misclassified samples. In the 
future, we will work on obtaining high-coverage test suites for RRIS, to improve 
testing of computer-based RIS. We will also explore compositional contract- 
based reasoning to reduce the computation time of model checking, and provide 
clear interfaces between RRIS modules. A tighter integration of the simulation 
capabilities within the tool front-end is also planned. 
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Abstract. Inspired by sum-of-infeasibilities methods in convex optimiza- 
tion, we propose a novel procedure for analyzing verification queries on 
neural networks with piecewise-linear activation functions. Given a convex 
relaxation which over-approximates the non-convex activation functions, 
we encode the violations of activation functions as a cost function and 
optimize it with respect to the convex relaxation. The cost function, 
referred to as the Sum-of-Infeasibilities (Sol), is designed so that its mini- 
mum is zero and achieved only if all the activation functions are satisfied. 
We propose a stochastic procedure, DeepSoI, to efficiently minimize the 
Sol. An extension to a canonical case-analysis-based complete search 
procedure can be achieved by replacing the convex procedure executed 
at each search state with DeepSoI. Extending the complete search with 
DeepSoI achieves multiple simultaneous goals: 1) it guides the search 
towards a counter-example; 2) it enables more informed branching deci- 
sions; and 3) it creates additional opportunities for bound derivation. An 
extensive evaluation across different benchmarks and solvers demonstrates 
the benefit of the proposed techniques. In particular, we demonstrate 
that Sol significantly improves the performance of an existing complete 
search procedure. Moreover, the SolI-based implementation outperforms 
other state-of-the-art complete verifiers. We also show that our technique 
can efficiently improve upon the perturbation bound derived by a recent 
adversarial attack algorithm. 


Keywords: neural networks - sum of infeasibilities - convex optimization 
- stochastic local search. 


1 Introduction 


Neural networks have become state-of-the-art solutions in various application 
domains, e.g., face recognition, voice recognition, game-playing, and automated 
piloting [47,30,55,7|. While generally successful, neural networks are known to be 


susceptible to input perturbations that humans are naturally invariant to [61,41]. 


This calls the trustworthiness of neural networks into question, particularly in 
safety-critical domains. 

In recent years, there has been a growing interest in applying formal methods 
to neural networks to analyze certain robustness or safety specifications [43]. Such 
specifications are often defined by a collection of partial input/output relations: 
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e.g., the network uniformly and correctly classifies inputs within a certain distance 
(in some J, norm) of a selection of input points. The goal of formal verification 
is to either prove that the network meets the specification or to disprove it by 
constructing a counter-example. 


Most standard activation functions in neural networks are non-linear, making 
them challenging to reason about. Consider the rectified linear unit (ReLU): if a 
ReLU can take both positive and negative inputs, a verifier will typically need 
to consider, separately, each of these two activation phases. Naive case analysis 
requires exploring a number of combinations that is exponential in the number 
of ReLUs, which quickly becomes computationally infeasible for large networks. 
To mitigate this complexity, neural network verifiers typically operate on convex 
relaxations of the activation functions. The relaxed problem can often be solved 
with an efficient convex procedure, such as Simplex [35,23] or (sub-)gradient 
methods [51,21]. Due to the relaxation, however, a solution may be inconsistent 
with the true activation functions. When this happens, the convex procedure 
cannot make further progress on its own. For this reason, to ensure completeness, 
the convex procedure is typically embedded in an exhaustive search shell, which 
encodes the activation functions explicitly and branches on them when needed. 
While the exhaustive search ensures progress, it also brings back the problem 
of combinatorial explosion. This raises the key question: can we guide the 
convex procedure to satisfy the activation functions without explicitly 
encoding them? 


In convex optimization, the sum-of-infeasibilities (Sol) [10] function measures 
the error (with respect to variable bounds) of a variable assignment. Minimizing 
the Sol naturally guides the procedure to a satisfying assignment. In this paper, 
we extend this idea to instead represent the error in the non-linear activation 
functions. The goal is to “softly” guide the search over the relaxed problem using 
information about the precise activation functions. If an assignment is found 
for which the Sol is zero, then not only is the assignment a solution for the 
relaxation, but it also solves the precise problem. Encoding the Sol w.r.t. the 
piecewise-linear activation functions yields a concave piecewise-linear function, 
which is challenging to minimize directly. Instead, we propose to minimize the Sol 
for individual activation patterns and reduce the Sol minimization to a stochastic 
search for the activation pattern where the Sol is minimal. The advantage is that 
for each activation pattern, the Sol collapses into a linear cost function, which 
can be easily handled by a convex solver. We introduce a specialized procedure, 
DeepSoI, which uses Markov chain Monte Carlo (MCMC) search to efficiently 
navigate towards activation patterns at the global minimum of the Sol. If the 
minimal Sol is ever zero for an activation pattern, then a solution has been found. 


An extension to a canonical complete search procedure can be achieved 
by replacing the convex procedure call at each search state with the DeepSolI 
procedure. Since the Sol contains additional information about the problem, we 
propose a novel Sol-aware branching heuristic based on the estimated impact 
of each activation function on the Sol. Finally, DeepSoI naturally preserves new 
bounds derived during the execution of the underlying convex procedure (e.g., 
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Simplex), which further prunes the search space in the complete search. For 
simplicity, we focus on ReLU activation functions in this paper, though the 
proposed approach can be applied to any piecewise-linear activation function. 

We implemented the proposed techniques in the Marabou framework for 
Neural Network Analysis [36] and performed an extensive performance evaluation 
on a wide range of benchmarks. We compare against multiple baselines and 
show that extending a complete search procedure with our Sol-based techniques 
results in significant overall speed-ups. Finally, we present an interesting use 
case for our procedure — efficiently improving the perturbation bounds found by 
AutoAttack [17], a state-of-the-art adversarial attack algorithm. 

To summarize, the contributions of the paper are: (i) a technique for guiding 
a convex solver with an Sol function w.r.t. the activation functions; (ii) DeepSoI— 
a procedure for minimizing the non-linear Sol via the interleaving use of an 
MCMC sampler and a convex solver; (iii) an Sol-aware branching heuristic, which 
complements the integration of DeepSoI into a case-analysis based search shell; 
and (iv) a thorough evaluation of the proposed techniques. 

The rest of the paper is organized as follows. Section 2 presents an overview of 
related work. Section 3 introduces preliminaries. Section 4 introduces the Sol and 
proposes a solution for its minimization. Section 5 presents the analysis procedure 
DeepSol, its use in the complete verification setting, and an Sol-aware branching 
heuristic. Section 6 presents an extensive experimental evaluation. Conclusions 
and future work are in Section 7. 


2 Related Work 


Approaches to complete analysis of neural networks can be divided into SMT- 
based [35,36,23], reachability-analysis based [5,64,65,29,25], and the more general 
branch-and-bound approaches [1,63,24,44,13,37,9]. As mentioned in [14], these 
approaches are related, and differ primarily in their techniques for bounding and 
branching. Given the computational complexity of neural network verification, a 
diverse set of research directions aims to improve performance in practice. Many 
approaches prune the search space using tighter convex relaxations and bound infe- 
rence techniques [64,23,31,58,56,45,76,70,67,66,20,63,69,52,62,51,73,26,68,59,8,57]. 
Another direction leverages parallelism by exploiting independent structures in 
the search space [48,75,71]. Different encodings of the neural network verification 
problems have also been studied: e.g., as MILP problems that can be tackled by 
off-the-shelf solvers [63,2], or as dual problems admitting efficient GPU-based 
algorithms [12,21,22,19]. DeepSoI can be instantiated with any sound convex 
relaxations and matching convex procedures. It can also be installed in any case- 
analysis-based complete search shell, therefore integrating easily with existing 
parallelization techniques, bound-tightening passes, and branching heuristics. 
Two approaches most relevant to our work are Reluplex [35] and Pere- 
griNN [37]. Reluplex invokes an LP solver to solve the relaxed problem, and then 
updates its solution to satisfy the violated activation functions — with the hope 
of nudging the produced solutions towards a satisfying assignment. However, 
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the updated solution by Reluplex could violate the linear relaxation, leading to 
non-convergent cycling between solution updates and LP solver calls, which can 
only be broken by branching. In contrast, our approach uses information about 
the precise activation functions to actively guide the convex solver. Furthermore, 
in the limit DeepSoI converges to a solution (if one exists). PeregriNN also uses 
an objective function to guide the solving of the convex relaxation. However, 
their objective function approximates the ReLU violation and does not guaran- 
tee a real counter-example when the minimum is reached. In contrast, the Sol 
function captures the exact ReLU violation, and if a zero-valued point is found, 
it is guaranteed to be a real counter-example. We compare our techniques to 
PeregriNN in Section 6. 

We use MCMC-sampling combined with a convex procedure to minimize the 
concave piecewise-linear Sol function. MCMC-sampling is a common approach for 
stochastically minimizing irregular cost functions that are not amenable to exact 
optimization techniques [32,53,3]. Other stochastic local search techniques [54,27] 
could also be used for this task. However, we chose MCMC because it is adept at 
escaping local optima, and in the limit, it samples more frequently the region 
around the optimum value. As one point of comparison, in Section 6, we compare 
MCMC-sampling with a Walksat-based [54] local search strategy. 


3 Preliminaries 


Neural Networks. We define a feed-forward, convolutional, or residual neural 
network with k + 1 layers as a set of neurons N, topologically ordered into 
layers Lo,..., Lpg, where Lo is the input layer and Ly is the output layer. Given 
ni, nj E N, we use nj < nj to denote that the layer of n; precedes the layer of nj. 
The value of a neuron n; E€ N\Lo is computed as act;(b; + Nong en, Wij * nj), an 
affine transformation of the preceding neurons followed by an activation function 
act;. We use n? and nl to represent the pre- and post-activation values of such a 
neuron: n? = act;(n?). For n; € Lo, n? is undefined and we assume nl can take 
any value. In this paper, we focus on ReLU neural networks. That is, act; is the 
ReLU function (ReLU (x) = max(0,x)) unless n; belongs to the output layer Lx, 
in which case act; is the identity function. We use R(N) to denote the set of 
ReLU neurons in N. An activation pattern is defined by choosing a particular 
phase (either active or inactive) for every n € R(N) (i.e., choosing either n? < 0 
or nè > 0 for each n; € R(N)). 

Neural Network Verification as Satisfiability. Consider the verification of 
a property P over a neural network N. The property P has the form Pin => Pout, 
where Pin and Pout constrain the input and output layers, respectively. P states 
that for each input point satisfying Pin, the output layer satisfies Pout. To 
formalize the verification problem, we first define the set of variables in a neural 
network N, denoted as Var(N), to be Un,en\ r {n$} U Unien zo {ne}. We define 
a variable assignment, a: Var(N) — R, to be a mapping from variables in N to 
real values. The verification task thus can be formally stated as finding a variable 
assignment a that satisfies the following set of constraints over Var(N) (denoted 
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as ¢):3 
Yni € N\Lo, nb =b + XO wiz «n8 (1a) 
Nj Ni 
Yn; € R(N), n? = act;(n®) (1b) 
Pin A SPF out (1c) 


If such an assignment a exists, we say that ¢ is satisfiable and can conclude 

that P does not hold, as from a we can retrieve an input x € Pin, such that 
the neural network’s output violates Pout. If such an a does not exist, we say 
@ is unsatisfiable and can conclude that P holds. We use a — ¢ to denote that 
an assignment a satisfies ¢. In short, verifying whether P holds on a neural 
network N boils down to deciding the satisfiability of ¢. We refer to ¢ also as 
the verification query in this paper. 
Convex Relaxation of Neural Networks. Deciding whether P holds on a 
ReLU network N is NP-complete [35]. To curb intractability, many verifiers 
consider the convex (e.g., linear, semi-definite) relaxation of the verification 
problem, sacrificing completeness in exchange for a reduction in the computational 
complexity. We use ¢ to denote the convex relaxation of the exact problem ¢. 
If ¢ is unsatisfiable, then ¢ is also unsatisfiable, and property P holds. If the 
convex relaxation is satisfiable with satisfying assignment a and a also satisfies 
$, then P does not hold. 

In this paper, we use the Planet re- ne 
laxation introduced in [23]. It is a linear 
relaxation, illustrated in Figure 1. Each P 
ReLU constraint ReLU(n®) = n® is over- ail 
approximated by three linear constraints: a 
n? > 0, nt > n®, and n? < on? — aal, i 
where u and l are the upper and lower pe” 
bounds of n?, respectively (which can be 7 0 m 
derived using bound-tightening techniques 
such as those in [67,58,76]). If Constraint 
1c is also linear, the convex relaxation ¢ is a Linear Program, whose satisfiability 
can be decided efficiently (e.g., using the Simplex algorithm [18]). 
Sum-of-Infeasibilities. In convex optimization [10,39], the sum-of-infeasibilities 
(SOI) method can be used to direct the feasibility search. The satisfiability of 
a formula ø is cast as an optimization problem, with an objective function 
representing the total error (i.e., the sum of the distances from each out-of- 
bounds variable to its closest bound). The lower bound of f is 0 and is achieved 
only if ¢ is satisfiable. In our context, we use a similar function fsoi, but with 
the difference that it represents the total error of the ReLU constraints in ¢. In 
our case, fsoi is non-convex, and thus a more sophisticated approach is needed 
to minimize it efficiently. 


Fig. 1: The Planet relaxation. 


3 The verification can also be equivalently viewed as an optimization problem [14]. 
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Complete Analysis via Exhaustive Search. One common approach for 
complete verification involves constructing a search tree and calling a convex 
procedure SOIVECONV at each tree node, as shown in Algorithm 1. SOLVECONV 
solves the convex relaxation ¢ and returns a pair r,a where either: 1) r = SAT 
and a |= @; or 2) r = UNSAT and ¢ is unsatisfiable. If @ is unsatisfiable or a 
also satisfies ¢, then the result for ¢ also holds for ¢ and is returned. Otherwise, 
the search space is divided further using BRANCH, which returns a set W of 
sub-problems such that ¢ and V W are equisatisfiable. 

Before invoking SOLVECONV to Algorithm 1 Complete search. 
solve @, it is common to first 1: Input: a verification query ¢. 
call an efficient bound-tightening 9. Output: SAT/UNSAT 
procedure (TIGHTENBOUNDS) tOo 3; function COMPLETESBARCH(¢) 


prune the search space or even 4: œ+- TIGHTENBOUNDS(¢) 
derive UNSAT preemptively. This 5: r,a- sotvEConv(¢) 
TIGHTENBOUNDS procedure can be 6: if r= UNSAT V a l ¢ then 
instantiated in various ways, in- 7: return r 
cluding with analyses based on 8 for ¢i € BRANCH(@) do 
LiPRA [74,76,58,70], kReLU [56], or 9: if COMPLETESEARCH(¢;) = SAT then 
PRIMA [49]. In addition to the ded- 12 pore SAT 

11: return UNSAT 


icated bound-tightening pass, some 
convex procedures (e.g., Simplex) also naturally lend themselves to bound in- 
ference during their executions [38,35]. The overall performance of Algorithm 1 
depends on the efficacy of bound-tightening, the branching heuristics, and the 
underlying convex procedure. 

Adversarial attacks. Adversarial attacks [61,46,28,15] are another approach 
for assessing neural network robustness. While verification uses exhaustive search 
to either prove or disprove a particular specification, adversarial attacks focus on 
efficient heuristic algorithms for the latter. From another perspective, they can 
demonstrate upper bounds on neural network robustness. In Section 6, we show 
that our analysis procedure can improve the bounds found by AutoAttack [17]. 


4 Sum of Infeasibilities in Neural Network Analysis 


In this section, we introduce our Sol function, consider the challenge of its 
minimization, and present a stochastic local search solution. 


4.1 The Sum of Infeasibilities 


As mentioned above, in convex optimization, an Sol function represents the sum 
of errors in a candidate variable assignment. Here, we build on this idea by 
introducing a cost function f,.;, which computes the sum of errors introduced 
by a convex relaxation of a verification query. We aim to use fo; to reduce the 
satisfiability problem for ¢ to a simpler optimization problem. We will need the 
following property to hold. 


Condition 1. For an assignment a, a = ¢ iff a = @A fsoi < 0. 


Efficient Neural Network Analysis with Sum-of-Infeasibilities 149 


If Condition 1 is met, then satisfiability of ¢ reduces to the following mini- 
mization problem: 
minimize foi 
j . (2) 
subject to ak od 


To formulate the Sol for ReLU networks, we first define the error in a ReLU 
constraint n as: 


E(n) = min(n* — n°, n?) (3) 


The two arguments correspond to the error when the ReLU is in the active and 
inactive phase, respectively. Recall that the Planet relaxation constrains (n?,n“) 
in the triangular area in Figure 1, where n° > n? and n° > 0. Thus, the minimum 
of E(n) subject to ¢ is non-negative, and furthermore, E(n) = 0 iff the ReLU 
constraint n is satisfied (this is also true for any relaxation at least as tight as the 
Planet relaxation). We now define fsoi as the sum of errors in individual ReLUs: 


fesi == 5 E(n) (4) 


nER(N) 


Theorem 1. Let N be a set of neurons for a neural network, ọ a verification 
query (an instance of (1)), and @ the planet relaxation of œ. Then fsoi as given 
by (4) satisfies Condition 1. 


Proof. It is straightforward to show that fsoi subject to @ is non-negative and is 
zero if and only if each E(n;) is zero. That is, min fso; subject to ¢ is zero if and 
only if all ReLUs are satisfied. Therefore, if œ satisfies ¢, then a — fsoi = 0. On 
the other hand, since an assignment a that satisfies ¢ can only violate the ReLU 
constraints in ¢, if a F fso; = 0, then all the constraints in ¢ must be satisfied, 
ie, aE @. 


Note that the error E, and its extension to Sol, can easily be defined for 
other piecewise-linear functions besides ReLU. We now turn to the question of 
minimizing f,.;. Observe that 


min fsoi = min =, E(n) = min ({s | f= Xo ti, hi € {ni} —nb,nt}}). 


n€R(N) niER(N) i ) 
5 


Thus, fsoi is the minimum over a set, which we will denote Ssoi, of linear 
functions. Although min fsoi cannot be used directly as an objective in a convex 
procedure, we could minimize each individual linear function f € Ssoi with a 
convex procedure and then keep the minimum over all functions. We refer to 
the functions in Ssoi as phase patterns of f,.;. For notational convenience, we 
define cost( f, ġ) to be the minimum of f subject to ¢. The minimization problem 
(2) can thus be restated as searching for the phase pattern f € Ssoi, where 
cost(f,@) is minimal. Note that for a particular activation pattern, fsoi = f for 
some f € Ssoi. From this perspective, searching for the f € Ssoi where cost(f,@) 
is minimal can also be viewed as searching for the activation pattern where the 
global minimum of f,,; is achieved. 
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4.2 Stochastically Minimizing the SoI with MCMC Sampling 


In the worst case, finding the minimal value of cost( f, ġ) requires enumerating 
and minimizing each f in Ssoi (or equivalently, minimizing fsoi for each activation 
pattern), which has size 21R(N)I, However, importantly, the search can terminate 
immediately if a phase pattern f is found such that cost(f,¢) = 0. We leverage 
this fact below. Note that each phase pattern has | R(N)| adjacent phase patterns, 
each differing in only one linear subexpression. The space of phase patterns is 
thus fairly dense, making it amenable to traversal using stochastic local search 
methods. In particular, intelligent hill-climbing algorithms, which can be made 
robust against local optima, are well suited for this task. 

Markov chain Monte Carlo (MCMC) [11] methods are such an approach. 
In our context, MCMC methods can be used to generate a sequence of phase 
patterns fo, fi, fa... € Ssoi, with the desirable property that in the limit, the 
phase patterns are more frequently from the minimum region of cost(f, @). 

We use the Metropolis-Hastings (M-H) algorithm [16], a widely applicable 
MCMC method, to construct the sequence. The algorithm maintains a current 
phase pattern f and proposes to replace f with a new phase pattern f’. The 
proposal comes from a proposal distribution q(f"|f) and is accepted with a certain 
acceptance probability m(f—f’). If the proposal is accepted, f’ becomes the new 
current phase pattern. Otherwise, another proposal is considered. This process is 
repeated until one of the following scenarios happen: 1) a phase pattern f is chosen 
with cost(f,@) = 0; 2) a predetermined computational budget is exhausted; or 3) 
all possible phase patterns have been considered. The last scenario is generally 
infeasible for non-trivial networks. In order to employ the algorithm, we transform 
cost(f,) into a probability distribution p(f) using a common method [34]: 


p(f) x exp(—B - cost( f, $) 


where ĝ is a configurable parameter. If the proposal distribution is symmetric 
(i.e., a( f| f") = a(f'|f)), the acceptance probability is the following (often referred 
to as the Metropolis ratio) [34]: 


P(f') 
" pF) 


Importantly, under this acceptance probability, a proposal reducing the value of 
the cost function is always accepted, while a proposal that does not may still be 
accepted (albeit with a probability that is inversely correlated with the increase 
in the cost). This means that the algorithm always greedily moves to a lower cost 
phase pattern whenever it can, but it also has an effective means for escaping 
local minima. Note that since the sample space is finite, as long as the proposal 
strategy is ergodic,‘ in the limit, the probability of sampling every phase pattern 
(therefore deciding the satisfiability of ¢) converges to 1. However, we do not 


m(f>f’) = min(1, ) = min (rep ( — B- ( cost(f’, 6) — cost(f, ))) 


4 A proposal strategy is ergodic if it is capable of transforming any phase pattern 
to any other through a sequence of applications. We use a symmetric and ergodic 
proposal distribution as explained in Section 5.1. 
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have formal guarantees about the convergence rate, and it is usually impractical 
to prove unsatisfiability this way. Instead, as we shall see in the next section, we 
enable complete verification by embedding the M-H algorithm in an exhaustive 
search shell. 


5 The DeepSol Algorithm 


In this section, we introduce DeepSoI, a novel verification algorithm that leverages 
the Sol function, and show how to integrate it with a complete verification 
procedure. We also discuss the impact of DeepSoI on complete verification and 
propose an Sol-aware branching heuristic. 


5.1 DeepSol 


Our procedure DeepSolI, 
shown in Algorithm 2, 
takes an input verifica- 
tion query ¢ and tries 
to determine its sat- 
isfiability. DeepSoI fol- 
lows the standard two- 


Algorithm 2 Analyzing ¢ with DeepSoI. 
1: Input: A verification query ¢. 

2: Output: SAT/UNSAT/UNKNOWN 

3: function DEEPSOI(¢) 
: T, Qo + SOLVECONV(¢) 


if r = UNSAT V ao 


E ¢ then return r, ao 


phase convex optimiza- 
tion approach. Phase I 


a, € + OPTIMIZECONV(f, ¢) 
while c > 0 ^ ~ EXHAUSTED() Ak < T do 


4 
5 
6:  k, f < 0, INITPHASEPATTERN(a0) 
T, 
8 


finds some assignment f! < PROPOSE( f) 

ao satisfying ¢, and 10: a’,c! 4+— OPTIMIZECONV( f’, 6) Phs. 
phase II attempts to opti- 11: if AccEPT(c,c’) then f,c, a + f'e, a 

mize the assignment us- 12: elsek&k+1 

ing the M-H algorithm. 13: if c = 0 then return SAT, a 

Phase II uses a convex 14: else return EXHAUSTED() ? UNSAT : UNKNOWN 


optimization procedure 

OPTIMIZECONV which takes an objective function f and a formula ¢ as in- 
puts and returns a pair a,c, where a F ¢ and c = cost(f,@) is the optimal 
value of f. Phase II chooses an initial phase pattern f based on ao (Line 6) and 
computes its optimal value c. The M-H algorithm repeatedly proposes a new 
phase pattern f’ (Line 9), computes its optimal value c’, and decides whether to 
accept f’ as the current phase pattern f. The procedure returns SAT when a phase 
pattern f is found such that cost(f,¢) = 0 and UNSAT if all phase patterns have 
been considered (EXHAUSTED returns true) before a threshold of T rejections is 
exceeded. Otherwise, the analysis is inconclusive (UNKNOWN). 

The ACCEPT method decides whether a proposal is accepted based on the 
Metropolis ratio (see Section 4). Function INITPHASEPATTERN proposes the 
initial phase pattern f induced by the activation pattern corresponding to 
assignment ao. Our proposal strategy (PROPOSE) is also simple: pick a ReLU n 
at random and flip its cost component in the current phase pattern f (either 
from n* — n° to n%, or vice-versa). This proposal strategy is symmetric, ergodic, 
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and performs well in practice. Both the initialization strategy and the proposal 
strategy are crucial to the performance of the M-H Algorithm, and exploring more 
sophisticated strategies is a promising avenue for future work. Importantly, the 
same convex procedure is used in both phases. Therefore, from the perspective 
of the convex procedure, DeepSoI solves a sequence of convex optimization 
problems that differ only in the objective functions, and each problem can be 
solved incrementally by updating the phase pattern without the need for a restart. 


5.2 Complete Analysis and Pseudo-impact Branching 


To extend a canonical complete verification procedure (i.e., Algorithm 1), its 
SOLVECONV call is replaced with DeepSoI. Note that the implementation of 
BRANCH in this algorithm has a significant influence on its performance. Here, 
we consider an Sol-aware implementation of BRANCH, which makes decisions by 
selecting a particular ReLU to be active or inactive. The choice of which ReLU 
is crucial. Intuitively, we want to branch on the ReLU with the most impact on 
the value of fsoi. After branching, DeepSoI should be closer to either: finding 
a satisfying assignment (if fo; is decreased), or determining unsatisfiability (if 
fsoi is increased). Computing the exact impact of each ReLU n on fo; would be 
expensive; however, we can estimate it by recording changes in fsoi during the 
execution of DeepSolI. 

Concretely, for each ReLU n, we maintain its pseudo-impact,? PI(n), which 
represents the estimated impact of n on f,.;. For each n, PI(n) is initialized to 
0. Then during the M-H algorithm, whenever the next proposal flips the cost 
component of ReLU n, we calculate the local impact on fsoi: A = |cost(f,¢é) — 
cost(f’,@)|. We use A to update the value of PI(n) according to the exponential 
moving average (EMA): PI(n) = y x PI(n) + (1 — y) - A, where y attenuates 
previous estimates of n’s impact. We use EMA because recent estimates are 
more likely to be relevant to the current phase pattern. At branching time, 
the pseudo-impact heuristic picks arg max,, PI(n) as the ReLU to split on. The 
heuristic is inaccurate early in the search, so we use a static branching order 
(e.g., [71,13]) while the depth of the search tree is shallow (e.g., < 3). 


6 Experimental Evaluation 


In this section, we present an experimental evaluation of the proposed techniques. 
Our experiments include: 1. an ablation study to examine the effect of each pro- 
posed technique; 2. a run-time comparison of our prototype with other complete 
analyzers; 3. an empirical study of the choice of the rejection threshold T in 
Algorithm 2; and 4. an experiment in using our analysis procedure to improve the 
perturbation bounds found by AutoAttack [17], an adversarial attack algorithm. 
An artifact with which the results can be replicated is available on Zenodo [72]. 


5 The name is in analogy to pseudo-cost branching heuristics in MILP, where the 
integer variable with the largest impact on the objective function is chosen [6]. 
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6.1 Implementation. 


We implemented our techniques in Marabou [36], an open-source toolbox for 
analyzing Neural Networks. It features a user-friendly python API for defining 
properties and loading networks, and a native implementation of the Simplex 
algorithm. Besides the Markov chain Monte Carlo stochastic local search algorithm 
presented in Section 5.1 and the pseudo-impact branching heuristic presented in 
Section 5.2, we also implemented a Walksat-inspired [54] stochastic local search 
strategy to evaluate the effectiveness of MCMC-sampling as a local minimization 
strategy. Concretely, from a phase pattern f, the strategy greedily moves to a 
neighbor f’ of f, with cost(f’,¢) < cost(f,@). If no such f’ exists (i.e., a local 
minimum has been reached), the strategy moves to a random neighbor. 

The SOLVECONV and OPTIMIZECONV methods in Algorithm 2 can be instan- 
tiated with either the native Simplex engine of Marabou or with Gurobi, an 
off-the-shelf (MI)LP-solver. The TIGHTENBOUNDS method is instantiated with 
the DeepPoly analysis from [58], an effective and light-weight bound-tightening 
pass, which is also implemented in Marabou. 


6.2 Benchmarks. 


We evaluate on networks from four different applications: MNIST, CIFAR10, 
TaxiNet, and GTSR. The network architectures are shown in Table 2. 
The MNIST [42] and CIFAR10 [40] net- 


works are established benchmarks used Bench. Model Type ReLUs Hid. Layers 


in previous literature (e.g., [19,37,71,75]) | “NIST AERE Fe Dri 
as well as in the 2021 VNN Competi- MNIST, FC 1536 6 
tion [4]. Notably, the same MNIST net- FaxiNet Taxil Conv 688 6 
works are used to evaluate the original Taxi2 Conv 2048 4 
: Taxi3 Conv 2752 6 
PeregriNN work. 
.  CIFAR1O CIFAR10, Conv 1226 4 
For MNIST and CIFAR1O networks, CIFARIO, Conv 4804 4 
we check robustness against targeted CIFAR10, Conv 5196 6 
læ attacks on randomly selected images cTsR GTSR, FC 600 3 
GTSR2 Conv 2784 4 


from the test sets. The target labels are 
chosen randomly from the incorrect la- Fig, 2: Architecture overview. 

bels, and the perturbation bound is sam- 

pled uniformly from {0.01, 0.03, 0.06, 0.09, 0.12, 0.15}. The TaxiNet [33] bench- 
mark set comprises robustness queries over regression models used for vision-based 
autonomous taxiing. Given an image of the taxiway captured by the aircraft, the 
model predicts its displacement (in meters) from the center of the taxiway. A 
controller uses the output to adjust the heading of the aircraft. Robustness is 
parametrized by input perturbation 6 and output perturbation €; we sample (ô, €) 
uniformly from {0.01, 0.03, 0.06} x {2,6}. The GTSR benchmark set comprises 
robustness queries on image classifiers trained on a subset of the German Traffic 
Sign Recognition benchmark set [60]. Given a 32 x 32 RGB image the networks 
classify it as one of seven different kinds of traffic signs. A hazing perturbation [50] 
drains color from the image to create a veil of colored mist. Given an image I, a 
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Bench. (#) MILPwrpverity Lp=re sors, SoIPi e sorie 
Solv. Time Solv. Time Solv. Time  Solv. Time Solv. Time 
MNIST, (90) 77 19791 47 6892 66 5635 70 5976 68 5388 
MNIST2 (90) 29 6125 24 514 36 4356 31 757 31 909 
MNIST3 (90) 23 957 21 1609 34 9519 35 8327 33 5270 
Taxi; (90) 90 786 61 9054 80 4257 89 1390 90 1489 
Taxi (90) 40 17093 2 891 70 5503 71 6889 71 7407 
Taxi (90) 89 5058 64 69715 87 1034 88 2164 87 997 
CIFAR10, (90) 76 4316 26 7425 69 6286 73 16469 69 5200 
CIFAR10, (90) 38 9879 18 845 41 4619 42 8129 42 6415 
CIFAR10q (90) 30 4198 21 3395 51 17679 51 15056 51 15015 
GTSR, (90) 90 2541 90 2435 89 4900 90 15238 90 4805 
GTSR2 (90) 90 23613 90 4456 90 7507 90 10426 90 6180 
Total (990) 673 94354 463 107230 711 71294 730 90822 721 59073 


Table 1: Instances solved by different configurations and their runtime (in seconds) 
on solved instances. 


perturbation parameter €, and a haze color Cf, the perturbed image I’ is equal 
to (1—«)-I+e-C/!. The robustness queries check whether the bound yielded by 
the test-based method in [50] is minimal.All pixel values are normalized to [0, 1], 
and the chosen perturbation values yield a mix of non-trivial SAT and UNSAT 
instances. 


6.3 Experimental Setup. 


Experiments are run on a cluster equipped with Intel Xeon E5-2637 v4 CPUs 
running Ubuntu 16.04. Unless specified otherwise, each job is run with 1 thread, 
8GB memory, and a 1-hour CPU timeout. By default, the SOLVECONV and 
OPTIMIZECONV methods use Gurobi. The following hyper-parameters are used: 
the rejection threshold T in Algorithm 2 is 2; the discount factor y in the EMA 
is 0.5; and the probability density parameter 8 in the Metropolis ratio is 10. 
These parameters are empirically optimal on a subset of MNIST benchmarks. 
In practice, the performance is most sensitive to the rejection threshold T, and 
below (Section 6.6), we conduct experiments to study its effect. 


6.4 Ablation study of the proposed techniques. 


To evaluate each individual component of our proposed techniques, we run several 
configurations across the full set of benchmarks described above. 

We first consider two baselines that do not minimize the Sol: 1. LP*"°— runs 
Algorithm 1 with the Split-and-Conquer (SnC) branching heuristic [71], which 
estimates the number of tightened bounds from a ReLU split; 2. MILPyrpyerizgy— 
encodes the query in Gurobi using MIPVerify’s MILP encoding [63].° 

We then evaluate three configurations of Sol-based complete analysis parame- 


snc 


terized by the branching heuristic and the Sol-minimization algorithm: 1. SOIS,— 


ê This configuration does not use the LP/MILP-based preprocessing passes from 
MIP Verify [63] because they degrade performance on our benchmarks. 
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Bench. (#) SoIP ie PeregriNN ERAN; ERAN2 
Solv. Time Solv. Time Solv. Time Solv. Time 
MNIST, (90) 70 5976 64 11117 76 18679 75 19520 
MNIST> (90) 31 757 31 2287 28 1910 28 3126 
MNIST3 (90) 35 8327 26 2344 24 1538 24 3292 
Taxi; (90) 89 1390 - - 90 1653 90 3262 
Taxiz (90) 71 6889 2 40 16460 35 31778 
Taxi (90) 88 2164 - - 88 1389 88 4581 
CIFAR10, (90) 73 16469 - - 77 4604 77 14269 
CIFAR10, (90) 42 8129 - - 41 14403 37 14453 
CIFAR10a (90) 51 15056 s š 31 7587 26 5245 
GTSR; (90) 90 15238 - - 90 2023 90 32585 
GTSR2 (90) 90 10426 - - 78 77829 T5 81232 
Total (990) 730 90822 - - 663 148075 645 213343 


Table 2: Instances solved by different complete verifiers and their runtime (in 
seconds) on solved instances. 


runs DeepSoI with the SnC branching heuristic; 2: SOT ne runs DeepSolI with 
the pseudo-impact (PI) heuristic; 3. SOI®3,.— runs the Walksat-based algorithm 
with the PI heuristic. Each Sol configuration differs in one parameter w.r.t. the 
previous, so that pair-wise comparison highlights the effect of that parameter. 
Table 1 summarizes the runtime performance of different configurations on 
the four benchmark sets. The three configurations that minimize the Sol, namely 
SOIfcmc, SOIP5, and SOTS2¢., all solve significantly more instances than the two 
baseline configurations. In particular, SOIS**°, solves 248 (53.4%) more instances 
than LP®"*. Since all configurations start with the same variable bounds derived 
by the DeepPoly analysis, the performance gain is mainly due to the use of Sol. 
Among the three Sol configurations, the one with both pi and mcmc solves the 
most instances. In particular, it solves 8 more instances than SOI‘;,., suggesting 
that MCMC sampling is, overall, a better approach than the Walksat-based 
strategy. On the other hand, SOThenc and SOIS**, show complementary behaviors. 
For instance, the latter solves 5 more instances on MNIST,, and the former 
solves 11 more on the Taxi benchmarks. This motivates a portfolio configuration 
SOI portfolio, Which runs SOIkcnc and SOIS2¢, in parallel. This strategy is able to 


solve 742 instances overall with a 1-hour wall-clock timeout, yielding a gain of at 
least 12 more solved instances compared with any single-threaded configuration. 


6.5 Comparison with other complete analyzers. 


In this section, we compare our implementation with other complete analyzers. 
We first compare with PeregriNN, which as described in Section 2 introduces a 
heuristic cost function to guide the search. We evaluate PeregriNN on the MNIST 
networks, the same set of networks used in its original evaluation. We did not 
run PeregriNN on the other benchmarks because it only supports .nnet format, 
which is designed for fully connected feed-forward ReLU networks. 

In addition, we also compare with ERAN, a state-of-the-art complete analyzer 
based on abstract interpretation, on the full set of benchmarks. ERAN is often 
used as a strong baseline in recent neural network verification literature and was 
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among the top performers in the past VNN Competition 2021. We compare with 
two ERAN configurations: 1. ERAN, — ERAN using the DeepPoly analysis [58] for 
abstract interpretation and Gurobi for solving; 2. ERAN2 — same as above except 
using the k-ReLU analysis [56] for abstract interpretation. We choose to compare 
with ERAN instead of other state-of-the-art neural network analyzers, e.g., alpha- 
beta crown [76,68], OVAL [19], and fast-and-complete [75], mainly because the 
latter tools are GPU-based, while ERAN supports execution on CPU, where our 
prototype is designed to run. This makes a fair comparison possible. Note that 
our goal in this section is not to claim superiority over all state-of-the-art solvers. 
Rather, the goal is to provide assurance that our implementation is reasonable. 
As explained earlier, our approach can be integrated into other complete search 
shells with different search heuristics, and is orthogonal to techniques such as 
GPU-acceleration, parallelization, and tighter convex relaxation (e.g., beyond 
the Planet relaxation), which are all future development directions for Marabou. 


Table 2 summarizes the runtime performance of different solvers. We include 
again our best configuration, SOIhenc, for ease of comparison. On the three 
MNIST benchmark sets, PeregriNN either solves fewer instances than SOThénc 
or takes longer time to solve the same number of instances. We note that 
PeregriNN’s heuristic objective function could be employed during the feasibility 
check of DeepSolI (Line 4, Algorithm 2). Exploring this complementarity between 
PeregriNN and our approach is left as future work. 


Compared with ERAN, and ERAN, 
S0Ifcmc also solves significantly more 
instances overall, with a performance 70 
gain of at least 10.1% more solved in- 
stances. Taking a closer look at the 
performance breakdown on individual 
benchmarks, we observe complemen- 
tary behaviors between SOIbtne and 
ERAN,, with the latter solving more 
instances than SOIfċmc on 3 of the 11 
benchmark sets. Figure 3 shows the 400 
cactus plot of configurations that run ° ne) time ts). oe 
on all benchmarks. ERAN, is able to 
solve more instances than all the other 
configurations when the time limit is short, but is overtaken by the three Sol- 
based configurations once the time limit exceeds 30s. One explanation for this 
is that the Sol-enabled configurations spend more time probing at each search 
state, and for easier instances, it might be more beneficial to branch eagerly. 


600 


Æ MILP 
=+ LP^sne 


-7 SOl^snc meme 


Number of Instances Solved 


Æ SOl^pi_wsat 
= S0l^pi_meme 
> ERAN_1 

= ERAN_2 


Fig. 3: Cactus plot on all benchmarks. 


Finally, we compare the portfolio strategy SOIportfo1io described in the previ- 
ous subsection to ERAN; running 2 threads. The latter solves 10.3% fewer instances 
(673 overall). Figure 4 shows a scatter plot of the runtime performance of these 
two configurations. For unsatisfiable instances, most can be resolved efficiently 
by both solvers, and each solver has a few unique solves. On the other hand, 
SOIportfolio 18 able to solve significantly more satisfiable benchmarks. 
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Fig. 4: Runtime of SOTportfo1io and ERAN, Fig. 5: Improvements over the pertur- 
running with 2 threads. bation bounds found by AutoAttack. 


6.6 Incremental Solving and the Rejection Threshold T 


The rejection threshold T in Algorithm 2 controls the number of rejected proposals 
allowed before returning UNKNOWN. An incremental solver is one that can accept 
a sequence of queries, accumulating and reusing relevant bounds derived by 
each query. To investigate the interplay of T and incrementality, we perform 
an experiment using the incremental simplex engine in Marabou while varying 
the value of T. We additionally control the branching order (by using a fixed 
topological order). We conduct the experiment on 180 MNIST, and 180 Taxi, 
benchmarks from the aforementioned distributions. 

Table 3 shows the number of solved instances, as well as the average time (in 
seconds) and number of search states on the 95 commonly solved UNSAT instances. 
As T increases, more satisfiable benchmarks are solved. 


Rejection threshold T 1 2 3 4 5 6 
SAT Solv. 192 199 196 204 203 207 
UNSAT Solv. 91 90 90 89 90 89 


Avg. time (common) 97.75 129.0 83.6 108.1 137.0 187.8 
Avg. states (common) 12948 12712 6122 5586 6404 8948 


Table 3: Effect of the rejection threshold. 


Increasing T can also result in improvement on unsatisfiable instances—either 
the average time decreases, or fewer search states are required to solve the same 
instance. We believe this improvement is due to the reuse of bounds derived 
during the execution of DeepSoI. This suggests that adding incrementality to the 
convex solver (like Gurobi) could be highly beneficial for verification applications. 
It also suggests that the bounds derived during the simplex execution cannot be 
subsumed by bound-tightening analyses such as DeepPoly. 
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6.7 Improving the perturbation bounds found by AutoAttack 


Our proposed techniques result in significant performance gain on satisfiable 
instances. It is natural to ask whether the satisfiable instances solvable by the 
Sol-enabled analysis can also be easily handled by adversarial attack algorithms, 
which as mentioned in Section 2, focus solely on finding satisfying assignments. In 
this section, we show that this is not the case by presenting an experiment where 
we use our procedure in combination with AutoAttack [17], a state-of-the-art 
adversarial attack algorithm, to find higher-quality adversarial examples. 

Conceretely, we first use AutoAttack to find an upper bound on the minimal 
perturbation required for a successful lə attack.We then use our procedure to 
search for smaller perturbation bounds, repeatedly decreasing the bound by 2% 
until either UNSAT is proven or a timeout (30 minutes) is reached. We use the 
adversarial label of the last successful attack found by AutoAttack as the target 
label. We do this for the first 40 correctly classified test images for the three 
MNIST architectures, which yields 120 instances. Figure 5 shows the improvement 
of the perturbation bounds. Reduction of the bound is obtained for 53.3% of the 
instances, with an average reduction of 26.3%, a median reduction of 22%, and a 
maximum reduction of 58%. This suggests that our procedure can help obtain a 
more precise robustness estimation. 


7 Conclusions and Future Work 


In this paper, we introduced a procedure, DeepSolI, for efficiently minimizing 
the sum of infeasibilities in activation function constraints with respect to the 
convex relaxation of a neural network verification query. We showed how DeepSoI 
can be integrated into a complete verification procedure, and we introduced a 
novel Sol-enabled branching heuristic. Extensive experimental results suggest 
that our approach is a useful contribution towards scalable analysis of neural 
networks. Our work also opens up multiple promising future directions, including: 
1) improving the scalability of DeepSoI by using heuristically chosen subsets of 
activation functions in the cost function instead of all unfixed activation functions; 
2) leveraging parallelism by using GPU-friendly convex procedures or minimizing 
the Sol in a distributed manner; 3) devising more sophisticated initialization and 
proposal strategies for the Metropolis-Hastings algorithm; 4) understanding the 
effects of the proposed branching heuristics on different types of benchmarks; 5) 
investigating the use of DeepSoI as a stand-alone adversarial attack algorithm. 
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Abstract. We report our experience in the formal verification of the 
reference implementation of the Beacon Chain. The Beacon Chain is 
the backbone component of the new Proof-of-Stake Ethereum 2.0 net- 
work: it is in charge of tracking information about the validators, their 
stakes, their attestations (votes) and if some validators are found to be 
dishonest, to slash them (they lose some of their stakes). The Beacon 
Chain is mission-critical and any bug in it could compromise the whole 
network. The Beacon Chain reference implementation developed by the 
Ethereum Foundation is written in Python, and provides a detailed op- 
erational description of the state machine each Beacon Chain’s network 
participant (node) must implement. We have formally specified and ver- 
ified the absence of runtime errors in (a large and critical part of) the 
Beacon Chain reference implementation using the verification-friendly 
language Dafny. During the course of this work, we have uncovered sev- 
eral issues, proposed verified fixes. We have also synthesised functional 
correctness specifications that enable us to provide guarantees beyond 
runtime errors. Our software artefact with the code and proofs in Dafny 
is available at https: //github.com/ConsenSys/eth2.0-dafny. 


1 Introduction 


The Ethereum network is gradually transitioning to a more secure, scalable and 
energy efficient Proof-of-Stake (PoS) consensus protocol, known as Ethereum 
2.0 and based off GasperFFG [2]. The Proof-of-Stake discipline ensures that 
participants who propose (and vote) for blocks are chosen with a frequency that 
is proportional to their stakes. Another major feature of Ethereum 2.0 is sharding 
which enables the main blockchain to split into a number of independent and 
hopefully smaller and faster chains. The transition from the current Ethereum 
1 to the final version of Ethereum 2.0 (Serenity) is planned over a number of 
years and will be rolled out in a number of phases. The first phase, Phase 0, is 
known as the Beacon Chain. It is the backbone component of Ethereum 2.0 as 
it coordinates the whole network of stakers and shards. 
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The Beacon Chain. The Beacon Chain (and its underlying protocol) is in 
charge of enforcing consensus, among the nodes, called validators, participating 
in the network, on the state of the system. The set of validators is dynamic: new 
validators can register by staking some ETH (Ethereum crypto-currency). Once 
registered, validators are eligible to participate and propose and vote for new 
blocks (of transactions) to be appended to the blockchain. The Beacon Chain 
shipped on December 1, 2020. At the time of writing (October 14, 2021), close 
to 250,000 validators have staked 7,780,000 ETH ($30 Billion USD). Consid- 
ering the coordination role and the amount of assets managed by the Beacon 
Chain, it is a mission-critical component of the Ethereum 2.0 ecosystem. The 
Beacon Chain reference implementation developed by the Ethereum Foundation 
is written in Python, and provides a detailed operational description of the state 
machine each Beacon Chain’s network participant (node) must implement. 


Our Contribution. Our contribution is many-fold: 


— We have formally specified and verified the absence of runtime errors in (a 
large and critical part of) the Beacon Chain reference implementation using 
the verification-friendly language Dafny. 

— During the course of this work, we have uncovered several issues, proposed 
verified fixes, some of which have been integrated in the reference imple- 
mentation, and others have resulted in sunstnatial improvements (accuracy, 
readability) of the reference implementation. 

— We have also manually synthesised functional correctness specifications that 
enable us to provide guarantees beyond runtime errors. 

— Our software artefact with the code and proofs in Dafny is publicly available 
in our repository at https: //github.com/ConsenSys/eth2.0-dafny . 


Related Work. The Ethereum Foundation has supported several projects re- 
lated to applying formal methods for the analysis of the Beacon Chain (and 
other components). A foundational project? was undertaken in 2019 by Run- 
time Verification Inc. and provided a formal and executable semantics in the 
K framework, to the reference implementation [1]. The semantics was validated 
and the reference implementation could be tested which resulted in a first set of 
recommendations and fixes to the reference implementation. Although it may be 
possible to formally verify the Beacon Chain with the K-framework tools, to the 
best of our knowledge it has not been done yet. Runtime Verification Inc. have 
also formally specified and verified (in Coq [11]) the underlying GasperFFG [2] 
protocol. Our work complements these formal verification projects. Indeed, our 
objective is to provide guarantees for the absence of bugs (runtime errors), and 
loop termination which goes beyond testing. We have chosen to use a verification- 
friendly programming language, Dafny [10], as it enables us to write the code in 
a more developer-friendly manner (compared to K). 


3 https://github.com/runtimeverification /beacon-chain-spec 
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2 The Beacon Chain Reference Implementation 


In this section we introduce the system we want to formally verify, what are the 
potential benefits and impacts of such of study, and we set out the goals of our 
experiment. 


2.1 System Description and Scope of the Study 


As a robust decentralised system, the Beacon Chain aims to implement a repli- 
cated state machine [9] that is fault-tolerant to a fraction of unreliable par- 
ticipants (e.g., participants that can crash). The replicated state machine is 
implemented with a number of networked identical state machines running con- 
currently. This provides redundancy and a more reliable system. The state of 
each machine changes on an occurrence of an event. As the machines operate 
asynchronously, two different machines may receive different events that cannot 
be totally ordered time-wise. This is why before processing an event and chang- 
ing their states, the state machines run a consensus protocol to decide which 
event they should all process next. The consensus protocol aims to guarantee 
(under certain conditions) that an agreement will be reached which ensures that 
events are processed in the same order on each machine. 


2.2 The Beacon Chain Reference Implementation 


The Beacon Chain (Phase 0) reference implementation [6] describes the state 
machine that every Beacon node (participant) has to implement. The idea is 
that anyone is allowed to be a participant in the decentralised Ethereum 2.0 
ecosystem when it is fully deployed. However, as the consensus protocol is Proof- 
of-Stake there must be a mechanism for participants to register and stake, to 
slash a participant’s stake if they are caught* misbehaving, i.e., not following the 
consensus protocol, and to reward them if they are honest. The Beacon Chain 
provides these mechanisms. It maintains records about the participants, called 
validators, ensuring fairness (each honest participant should have a voting power, 
for new blocks, related to its stake), and safety (a dishonest participant may be 
slashed and lose part of their stakes). 

The full Beacon Chain (Phase 0) reference implementation [6] comprises three 
main sections: 


1. the Beacon Chain State Transition describing the Beacon state machine 
which is the most complex component; 

2. The Simple SerialiZe (SSZ) library for how to encode/decode (serialise/de- 
serialise) data that have to be communicated over the network; 

3. the Merkleise library for how to build efficient encoding of data structures 
into Merkle trees, and how to use them to verify Merkle proofs. 


4 In a distributed system with potentially dishonest participants, it is not always 
possible to detect who is dishonest (byzantine). However, sometimes a participant 
can sometimes be proved to be dishonest. 
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The State Transition. The Beacon Chain state transition part is the most 
critical part and at the operational level the complexity stems from: 


— time is logically divided into epochs, and each epoch into a fixed number of 
slots; the state is updated at each slot; 

— at the beginning of each epoch, disjoint subsets of validators are assigned to 
each slot to participate in the block proposal for the slot and attest (vote) 
for links in the chain; 

— the state updates that apply at an epoch boundary are more complex than 
the other updates; 

— the actual state of the chain is a block-tree i.e., a tree of blocks, and the 
canonical chain is defined as a particular branch in this tree. How this branch 
is determined is defined by the fork choice rule. 

— the fork choice rule relies on properties of nodes, justification and finalisa- 
tion, in the block-tree. The state update describes how nodes in the block- 
tree are deemed justified/finalised. The rules for justification and finalisation 
are introduced in a separate document, the GasperFFG [2] protocol. 


SSZ and Merkleise. These libraries are self-contained and independent from 
the state transition. We used them as a feasibility study and we had verified 
them before this project started. We have provided a complete Dafny reference 
implementation for them in the merkle and ssz packages [3]. 


2.3 Motivation for Formal Verification 


As mentioned previously, the Beacon Chain shipped on December 1, 2020 and up 
to date, 250,000 validators have staked 7,780,000 ETH ($30 Billion USD). It is 
clear that any bug, or logical error, could have disastrous consequences resulting 
is losses of assets for regular users, or downtimes and degradation of service, or 
losses of rewards for the validators. 

There are regular opportunities (forks) to update the code of Beacon Chain 
nodes, so continuously running projects like ours is very valuable as what is 
important is to find and fix bugs before attackers can exploit them. The op- 
erational description of the Beacon Chain in the reference implementation is 
provided in Python. It was written by several reference implementation writers 
at the Ethereum Foundation and due to its size it is hard for one person to 
have a complete picture of it. It is the reference for any Beacon Chain client 
implementer. As a result, inaccuracies, ambiguities, or bugs in the reference im- 
plementation will lead to erroneous and/or buggy clients that can compromise 
the integrity, or the performance of the network. Moreover the reference imple- 
mentation uses a defensive mechanism against unexpected errors: 


(Rule 1) “State transitions that trigger an unhandled exception (e.g. a 
failed assert or an out-of-range list access) are considered invalid. State 
transitions that cause a uint64 overflow or underflow are also considered 
invalid.” [6] 
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However this creates a risk that errors unrelated to the logic of the state tran- 
sition function may introduce spurious exceptions. At the time of writing, there 
are at least 4 different Ethereum 2.0 client softwares that are used by validators. 
Bugs in the reference implementation may be handled differently in the various 
clients, and in some cases lead to a split in the network®. The correctness of the 
consensus mechanism is guaranteed for up to 1/3 of malicious nodes, that. is, 
nodes deviating from the reference implementation, be it intentionally or unin- 
tentionally (e.g., because of a bug in the code). Hence, we should try to make 
sure we reduce (buggy) unintentionally malicious nodes. 


2.4 Objectives of the Study 


Our goal is to improve the overall safety, readability and usability of the reference 
implementation. Testing is of course an option, and Beacon Chain clients all 
implement some form of testing. In this project we are interested in proving the 
absence of bugs which goes beyond what testing techniques can do: testing can 
show the presence of bugs but not their absence (Dijkstra, 1970). 

The primary aspect of our project was to make sure that the code was 
free of runtime errors (e.g., over/underflows, array-out-of-bounds, division-by- 
zero, ...). This provides more confidence that when an exception occurs and 
a state is left unchanged as per (Rule 1), the root cause is a genuine prob- 
lem related to the state transition having been given an ill-formed block: if 
state_transition(state,signed_block) triggers an exception, it should im- 
ply that there is a problem with the signed_block not that some intermediate 
computations resulted in runtime errors. A secondary goal was to try and synthe- 
sise functional specifications from the reference implementation. This can help 
developers to design tests, and contributes to the specifications being language- 
agnostic. For instance, it can help write a client in a functional language which 
results in a more inclusive ecosystem. 


3 Formal Specification and Verification 


In this section we present the challenges of the project, motivate our methodology 
and conclude with our results’ breakdown. 


3.1 Challenges 


The main challenges in this formal verification project are in the verification of 
the code of the state_transition component of the Beacon Chain. The SSZ 
and Merkleise libraries are much smaller, simpler, and independent components 
that can be dealt with separately. 

The reference implementation for the Beacon Chain [6] introduces data types 
and algorithms that should be interpreted as Python 3 code. As a result it may 


5 A network split can be caused if some clients reject a chain that is being followed 
by the other clients, which leads to a hard fork-like situation. 
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not be straightforward for those who are not familiar with Python to under- 
stand the meaning of some parts of the code. More importantly, the reference 
implementation is not executable and may contain type mismatches, incompat- 
ible function signatures, and bugs that can result in runtime errors like under- 
overflows or array-out-of-bounds. 


Listing A.1. The state transition function. 


def state_transition ( 
state: BeaconState, 
signed_block: SignedBeaconBlock, 
validate_result: bool=True 
) -> None: 
block = signed_block.message 
# Process slots (including those with no blocks) since block 
process_slots(state, block.slot) 
9 # Verify signature 
10 if validate_result: 
11 assert verify_block_signature(state, signed_block) 
12 # Process block 
13 process_block(state, block) 
14 # Verify state root 
15 if validate_result: 
16 assert block.state_root == hash_tree_root (state) 


o Noanne UNE 


A typical function in the reference implementation is written as a sequence 
of control blocks (including function calls) intertwined with checks in the form of 
assert statements. The state_transition function (Listing A.1) is the com- 
ponent that computes the update of the Beacon Chain’s state. The state (of 
type BeaconState) records some information including the validators’ stakes, 
the subsets of validators (committees) allocated to a given slot, and the hashes? 
of the blocks that have already been added to the chain. A state update is 
triggered when a (signed) block is added to Beacon Chain. The state machine 
implicitly defined by the reference implementation generates sequences of states 
of the form: i 5 i 

SO — S1 —> SQ... a Sn41-:: (StateT) 


where sọ is given (initial values), bo is the genesis block and for each i > 1, si+1 = 
state_transition(s;, };). 
There are several challenges in testing or verifying this kind of code: 


— the functions calls (lines 8, 13) mutate the input variable state; those func- 
tions also call other functions that mutate the state. 

— the semantics is not fully captured by the Python 3 interpretation because 
of the defensive mechanism [S1] (Section 2.3, page 4). 

— a valid state transition is the opposite of an invalid state transition (char- 
acterised by [S1]). Determining when a computation is not going to trigger 
runtime errors or failed asserts is non-trivial. This is due to the use of mu- 
tating functions that can contain assert statements on values that are the 
results of intermediate computations. 


6 The actual blocks are recorded in the Store which is a separate data structure. 
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— overall the code in the reference implementation does not explicitly define 

what properties signed_block should satisfy to guarantee that executing the 
function state_transition(state,signed_block) is not going to trigger 
an exception. The implicit semantics of the code is: if an exception occurs 
in executing state_transition with input signed_block, then this block 
must be invalid (assuming state is always valid). 
It follows that, if the code contains a bug that triggers a runtime error un- 
related to signed_block (e.g., an intermediate computation that overflows, 
or an array-out-of-bounds in a sorting algorithm), signed_block is declared 
invalid and not added to the chain. To alleviate this problem, we have col- 
lected the conditions (predicates) under which the addition of a block should 
not fail, which clearly defines when a block is valid. 

— as there is no reference functional specification it is not immediate to under- 
stand when a block is invalid, and to write (unit) tests. 

— finally the correctness of parts of the code rely on hidden assumptions, 
e.g., the total amount of ETH is X so no overflow should happen. 


The challenges pertaining to the SSZ and Merkleise libraries are more manage- 
able. First, the reference implementation is shorter. Second, even if there is no 
functional specification available, it is reasonably easy to synthesise them. Due to 
the previous weaknesses, the reference implementation [6] has been the subject 
of several informal explainers [15,5,6]. 


3.2 Methodologies 


Resource Constraints. Resource-wise, the timeframe for our project was ap- 
proximately 8 months (October 2020 to June 2021), with a team of two formal 
verification researchers (first two co-authors) and one Beacon Chain expert re- 
searcher (third co-author). 


Verification Technique. The reference implementation is not the opera- 
tional description of a distributed system, but rather a sequential state machine, 
as per (StateT), Section 3.1. Thus, techniques and tools that are adequate for 
the goals we set are related to program formal verification. 

There are several techniques to approach program verification, ranging from 
fully automated (e.g., static analysis/abstract interpretation [4], software model- 
checking [8]) to interactive theorem proving [13]. Most static analysers are un- 
sound (they cannot prove the absence of bugs) which disqualifies them for our 
project. It is anticipated that fully automated verification techniques can be ef- 
fective to detect runtime errors but may have limited applicability to proving 
functional correctness. 

On the other side of the spectrum, interactive theorem provers offer a com- 
plete arsenal of logics/rules that can certainly be used for this kind of projects. 
However they usually require encoding the software to be verified in a high-level 
mathematical language that is rather different to a language like Python. The 
level of expertise/experience required to properly use these tools is also high. 
Overall this seemed incompatible with our available resources. 
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A middle-ground between fully automated and interactive techniques is de- 
ductive verification available in verification-friendly programming languages like 
Dafny [10], Why3 [7], Viper [12] or Whiley [14]. Deductive verification lets veri- 
fication engineers propose proofs and check them fully automatically. 

We opted for Dafny [10], an award-winning verification-friendly language. 
Dafny is actively maintained’ and under continuous improvement. It offers im- 
perative/object oriented and functional programming styles. Moreover, some of 
us had a previous exposure to Dafny (working on the SSZ/Merkleise libraries 
early in 2020), and we could be fully operational quickly, and it was compatible 
with our resources. We are convinced that similar results could be achieved with 
Why3, Viper or Whiley but did not have the resources to launch concurrent 
experiments. 


Verification Strategy. Our strategy to write the Beacon Chain reference im- 
plementation in Dafny and detect/fix runtime errors, and prove some functional 
properties is three-fold: 


1. Identify simplifications. The reference implementation is complex and 
trying to encode it fully in Dafny may result in inessential details hindering 
our verification progress. One example is the different data types (classes) 
for Attestations. There are several variations of the type Attestations 
and functions to convert between them. For our verification purposes, using 
PendingAttestations instead of the fully fledged Attestations was ade- 
quate. Another example is the abstraction of hashing functions. We assumed 
an uninterpreted collision-free hash function as we did not aim to prove any 
probabilistic properties involving this function. 

2. Translate the reference implementation in Dafny. This helped the 
formal verification researchers to familiarise themselves with the reference 
implementation. During this phase, we focussed on adding pre and post con- 
ditions to the functions of the reference implementation to guarantee the 
absence of runtime errors. We were also able to prove some interesting invari- 
ants: the data structure that contains the block-tree is indeed a well-formed 
tree. This structure is implemented with links from nodes to their parent 
(where null is a possible parent in the code). The invariant states that the 
block-tree that is built with the state_transition function satisfies: i) the 
set of ancestors of any block contain blocks with strictly smaller slot number 
and is finite (no cycles) iz) the set of ancestors of any block in the block-tree 
always contains the genesis block (with slot 0). 

3. Synthesise functional specifications. In the last phase, we manually 
synthesised functional specifications for each function in the reference im- 
plementation. We proved that each function in the reference implementation 
satisfied its functional specification. This enabled us to prove more complex 
properties as we could do the formal reasoning and proofs on the functional 
specifications and the results would carry over to the reference implemen- 
tation. This was an effective solution to be able to prove properties of the 


T https: //github.com/dafny-lang/dafny 
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reference implementation with lots of mutations (side-effects) without having 
to embed them deep in the proofs. 


3.3 Results 


The complete code base is freely available in [3]. There are several resources 
apart from the verified code: a Docker container to batch verify the code, and 
some notes/videos to help navigate the Dafny specifications. 


Coverage. We estimated that we have verified 85% of the reference imple- 
mentation. The remaining 15% are simplifications e.g., data types, or using a 
fixed set of validators instead of a dynamic set. Adding the remaining details 
to the released version would require a substantial amount of work and at the 
same time it seems that the likelihood of finding new issues is low. Since the 
Beacon Chain has shipped in December 1, 2020, only a few minor issues have 
been uncovered and promptly fixed which seems to confirm the previous claim. 


Absence of Runtime Errors. All of the functions we have implemented 
in Dafny are annotated with pre (requires) and post (ensures) conditions 
that are verified, including loop termination. The Dafny version of function 
state_transition is given in Listing A.2. Other functions are written simi- 
larly e.g., process_slots and process_block. The Dafny verifier enforces the 
absence of runtime errors like division by zero, under/overflows, array-out-of- 
bounds. It follows that our code base is provably free of this kind of defect. 
Moreover, additional checks can be added like the assert statement at line 28. 
We have added all the assert statements from the reference implementation 
and proved that they could not be violated. This requires adding suitable pre- 
conditions. 

Regarding loop termination proofs, most of the proofs are based on relatively 
simple ranking functions. An example of a non-trivial proof termination can be 
found in a functional correctness proof: the ancestors of a given block form a 
strictly decreasing sequence, slot-wise, and consequently end up in the genesis 
block. The corresponding code is in the Forkchoice.dfy file. 


Functional Correctness. Beyond the absence of runtime errors, we have syn- 
thesised functional specifications based off the reference implementation code. 
For instance we have decomposed the state update in state_transition into 
a sequence of simpler steps, updateBlock, forwardStateToSlot, nextSlot and 
proved that the result is a composition of these functions. This provides more 
confidence that the code is functionally correct as our decomposition specifies 
smaller changes in the state. It also enables us to prove properties on the func- 
tional specifications and transfer them to the imperative version of the code. 


Impact of our Project. During the course of this projects we have reported 
several issues, some of them bugs (3), some of them need for clarifications (5) 
in the reference implementation. The issues we have uncovered are tracked in 
the issues tracker of our github repository. Some of the bugs we reported have 
been fixed and our clarifications category has led to several improvements in 
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the writing of the reference implementation. Moreover, we have provided a fully 
documented version of the reference implementation in Dafny. The Dafny code 
contains clear pre and post conditions that can help developers understand the 
effect of a function and can be used to write unit tests. 


Listing A.2. Dafny version of state_transition 


1 method state_transition(s:BeaconState ,b:BeaconBlock) 
2 returns (s’: BeaconState) 
3 // A valid state to start from 

4 requires |s.validators| == |s.balances| 

5 requires is_valid_state_epoch_attestations(s) 

6 // b must a block compatible with s 

7 requires isValidBlock(s, b) 

8 // Functional correctness 

9 ensures s?’ == 

10 updateBlock(forwardStateToSlot (nextSlot(s),b.slot),b) 
11 // Other post-conditions 

12 oad 

13 ensures s’.slot == b.slot 

14 ensures s’.latest_block_header.parent_root == 
15 hash_tree_root ( 

16 forwardStateToSlot (nextSlot(s), b.slot) 
17 .latest_block_header 

18 ) 

19 ensures |s’.validators| == |s?’.balances | 

20 

21 { 

22 // Finalise slots before b.slot. 

23 s? := process_slots(s, b.slot); 

24 

25 // Process block and compute the new state. 

26 s? := process_block(s’, b); 

27 

28 // Verify state root (from eth2.0 specs) 

29 assert (b.state_root == hash_tree_root(s’)); 

30 } 


Statistics. Table 1, page 11, provides some insights into the actual code, 
per file. We have tried to keep the size of each file small and provide optimal 
modularity in the proofs. The files in the packages fall into one of the three 
categories: file.dfy is the Python-reference implementation translated into 
Dafny; file.s.dfy contains the functional specifications we have synthesised 
and file.p.dfy any additional proofs (Lemmas) that are used in the correct- 
ness proofs. It is hard to estimate the lines of code to lines of proofs ratio for 
many reasons: į) it is not always possible to locate all the proofs in a separate 
unit (e.g. a module in Dafny), as this can create circular dependencies. 

It follows that counting lines of proofs as lines in the Lemmas is not an 
accurate measure; ii) in some of the proofs, we have, on purpose, provided re- 
dundant hints. As a result some proofs can be shortened but this may be at the 
expense of readability (and verification time). For this project, a conservative 
(and empirical) lines of code to lines of proofs ratio seems to be around 1 to 7. 
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Table 1. Statistics. A file providing functional specifications. A file providing proofs 
(lemmas in Dafny). ##LoC (resp. ##DoC) is the number of lines of code (resp. docu- 
mentation), Lem. the number of proper lemmas, Imp. the number of proved impera- 
tive functions with pre/post conditions. 


Files Package ##LoC Lem. Imp. #Doc tpos (%) Proved 
BeaconChainTypes.dfy beacon 54 0 0 171 317 0 
Helpers.dfy beacon 1003 9 89 670 67 98 
AttestationsTypes.dfy beacon/attestations 30 0 0 68 227 0 
ForkChoice.dfy beacon/forkchoice 229 3 15 172 75 18 
ForkChoiceTypes.dfy beacon/forkchoice 9 0 0 17 189 0 
Crypto.dfy beacon/helpers 7 0 1 3 43 1 
EpochProcessing.dfy beacon/statetransition 384 0 14 127 33 14 
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ProcessOperations.dfy beacon/statetransition 361 


© 
co 
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[ez] 
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StateTransition.dfy beacon/statetransition 215 


Validators.dfy beacon/validators 11 0 0 53 482 0 
Merkleise.dfy merkle 504 9 18 135 27 27 
BitListSeDes.dfy SSZ 262 7 3 64 24 10 
Bit VectorSeDes.dfy SSZ 155 4 3 53 34 

BoolSeDes.dfy SSZ 22 0 2 3 14 2 
BytesAndBits.dfy SSZ 90 7 6 44 49 13 
Constants.dfy SSZ 104 0 0 36 35 0 
IntSeDes.dfy SSZ 130 2 2 20 15 4 
Serialise.dfy SSZ 514 3 5 36 T 8 
DafTests.dfy utils 62 (0) 4 25 40 4 
Eth2Types.dfy utils 227 1 3 77 34 4 
Helpers.dfy utils 220 11 3 103 4T 14 
MathHelpers.dfy utils 293 18 6 105 36 24 
NativeTypes.dfy utils 28 (0) 0 13 46 

NonNativeTypes.dfy utils 8 0 0 6 75 0 
SeqHelpers.dfy utils 69 8 2 58 84 10 
SetHelpers.dfy utils 74 6 0 50 68 6 


TOTAL 6570 170 208 3212 49 378 
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4 Findings and Lessons Learned 


During the course of our formal verification effort we found subtle bugs and also 
proposed some clarifications for the reference implementations. In addition, our 
work was the opportunity to start some discussions about how to improve the 
readability of the reference implementation, e.g., by using pre and post conditions 
rather than assert statements. In this section we provide more insights into 
some of the main issues we reported®, and also on the practicality of this kind 
of project. 


4.1 Array-out-of-bounds Runtime Error 


The function get_attesting_indices (Listing A.3) is called from within several 
important components of the state_transition function including the process- 
ing of rewards and penalties, justification and finalisation, as well as the pro- 
cessing of attestations (votes). 


Listing A.3. Python code for get_attesting_indices. 


def get_attesting_indices( 
state: BeaconState, 
data: AttestationData, 
bits: Bitlist [MAX1] 
) -> Set [ValidatorIndex]: 
unn 
Return the set of attesting indices corresponding to 
‘‘data‘‘ and ‘‘bits‘‘. 
unn 
committee=get_beacon_committee (state, data.slot, data.index) 
return 
# Collect indices in committee for which bits is set 
set(index for i, index in enumerate(committee) if bits[i]) 


Bee ee 
WNrFOTOOANDORWNEH 


The last line (13) of get_attesting_indices collects the indices in the ar- 
ray committee that have a corresponding bit set to true in array bits and 
returns it as a set of indices. The length of bits, noted |bits]|, is MAX1. Conse- 
quently, the following relation must be satisfied to avoid an array-out-of-bounds 
error: |committee| < MAX1. It follows that to prove? the absence of array-out-of- 
bounds error in Dafny, the specification of get_attesting_indices (in Dafny) 
requires a pre-condition, |get_beacon_committee(...)| < MAX1 (line 10). This 
pre-condition naturally imposes a post-condition for get_beacon_committee 
and trying to prove this post-condition we uncovered a very subtle bug: de- 
pending on the number of active validators V in state: 


V < 4,194,304: there is no array-out-of-bounds error as we can prove that 
|get_beacon_committee(...)| < MAX1 for all values of the input parameters 
data.slot and data. index, 


8 https: //github.com/ConsenSys/eth2.0-dafny /issues 
° In Dafny, this check is built-in so you cannot avoid this proof. 
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4,194,304 < V < 4,196,352: there is at least one value of the input parame- 
ters data.slot and data.index for which |get_beacon_committee(...)| > 
MAX1, which results in an array-out-of-bounds, and 

4,196,352 < V: for all input combination of data.slot and data. index, there 
is an array-out-of-bounds |get_beacon_committee(...)| > MAX1. 


This previously undocumented bug was difficult to detect. It required many 
hours of effort to model the dynamics of the problem; the analysis was quite 
complex due to the multiple interrelated parameter calculations, as well as the 
use of floored integer division. The full description and the analysis of this bug 
has been reported as issue!’ to the reference implementation github repository. 
The issue was confirmed by the reference implementation writers. 


4.2 Beyond Runtime Errors 


We have also been able to establish some well-formedness properties of the data 
structure that represents the block-tree built by each node. Each added block 
has a stamp, the slot number and a link to its parent. The block-tree is the tree 
representation of the parent relation. The block-tree should satisfy the following 
properties: 


— Every block b except the genesis block has a parent, 

— Every block b with parent p is such that the slot of b is strictly larger than 
the slot of p, 

— the transitive closure of the parent relation produces chains of blocks that 

are totally ordered using the < relation on slot, 

the smallest element of each chain has slot 0 (and consequently is the genesis 

block). 


We have established these properties in ForkChoice.dfy using a list of invariants 
on the Store. 

Another noticeable contribution compared to other approaches (like testing) 
is that we have proved the termination of all loops. For the majority of the 
loops, the ranking function used to prove termination is rather straightforward. 
An example of a more complicated (decreasing) ranking function can be found 
in the proof of a (functional correctness) lemma in ForkChoice.dfy: the proof 
relies on the slot number of a block’s parent being strictly smaller than the slot 
number of a block itself. The lemma establishes that the graph defined by the 
parent relation on the blocks in the store, is always well-formed and is a (block- 
)tree: the list of ancestors of any block in the store is ordered (slot-wise) and the 
smallest element is the genesis block. 


4.3 Finalisation and Justification 


During the course of the project we benefited from the guidance of the third co- 
author who has comprehensive expertise in various aspects of the Beacon Chain, 


10 https: //github.com/ethereum/consensus-specs/issues/2500 
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including the fork choice part, and identified the fork choice implementation of 
the reference implementation as a component that needed verification. 

The fork choice rules are designed to identify a canonical branch in the block- 
tree which in turn defines the canonical chain. To achieve this goal, we first as- 
sumed a fixed set of validators. Then we built a Dafny proof of the GasperFFG [2] 
protocol and tried to prove properties about the justified and finalised blocks in 
the block-tree. We could mechanically prove Lemmas 4.11 and 5.1, Theorem 5.2 
from [2]. Note that a complete proof in Coq is available in [11] but it does not 
use the Beacon Chain data structures. We only managed to push these proper- 
ties up to a certain level on the functional specifications of our code base and 
not on the actual reference implementation. Doing so would require us to add 
a substantial amount of details and to modify the structure of several proofs 
which was not doable in our timeframe. This experimental work is archived in 
branch goali of the repository. There is a currently ongoing work focussing on 
this topic: designing the mechanised proofs!! of the refinement soundness of the 
state transition function (Phase 0) w.r.t. the GasperFFG protocol. 


4.4 Reflection 


Verification Effort. The effort for formal verification took 16 person-months. 
This figure is for the Beacon Chain State Transition and does not include the 
time spent on the SSZ and Merkleise libraries that were completed before this 
project started. The division of time was primarily between the second and 
third components of the project. Translation of the reference implementation in 
Dafny, took approximately 6 person-months!?. Synthesis of functional specifica- 
tions (manually), including proofs, took approximately 10 person-months. The 
time allocation for the identification of simplifications is more difficult to assess. 
Though some consideration was given initially, this aspect was ongoing, as our 
understanding of the reference implementation evolved. 


Trust Base. The validity of the verification results assumes the correctness 
of the Dafny specification and the Z3 verifier. Dafny is actively maintained and 
under continuous improvement. And in the rare instance where Dafny behaves 
unpredictability, bug reports are responded to in a timely manner. During the 
course of this project a few bugs were reported. For example it was found that 
the definition of an inconsistent const could lead to unsound verification results 
and reported as an issue!® (fixed) to the Dafny language github repository. 


Practicality of the Approach. The use of Dafny does not require any spe- 
cific knowledge beyond standard program verification (Hoare style proofs) and 
first-order logics. There is ample support (videos, tutorials, books) to help learn- 
ing how to write Dafny programs and proofs. The main difficulties/challenges 
in writing and verifying projects of this size with Dafny (and the same holds for 


11 https: //github.com/runtimeverification/beacon-chain-verification 
12 This translation includes the proof of absence of runtime errors. 
13 https: //github.com/dafny-lang/dafny /issues/922 
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other verification-friendly automated deductive verifiers) are: 1. when the veri- 
fication fails, it requires some experience to interpret the verifier feedback and 
make some progress, and 2. the unpredictability (time-wise) of the reasoning 
engine; this is due to the fact that verification conditions that are generated by 
Dafny are in semi-decidable theories of the underlying SMT-solver (Z3). In our 
experience, adding a seemingly innocuous line of proof may result in either a 
surge or a drastic reduction of verification time. 


5 Conclusion 


Overall this project was a significant undertaking. The complexity of the state 
transition mechanism, combined with the ambitious project scope, makes this 
one of the largest formal verification projects to be completed using Dafny. Even 
with the model simplifications, the Python language is not particularly compat- 
ible with the fundamentals that underpin formal verification, which presented 
continual challenges. Upon reflection: i) the project would have benefited from a 
larger team and ii) consideration of the application of formal verification meth- 
ods earlier, ideally within the design process, would have had a positive impact. 

The interest generated from this project provided an opportunity to facili- 
tate Dafny training for the reference implementation writers at the Ethereum 
Foundation. This training included the translation of code into Dafny, as well 
as the more advanced topic of proof construction. Participants were able to gain 
insight into the formal verification process which could provide valuable context 
when drafting future reference implementations and specifications. 
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Abstract. The Move Prover (MVP) is a formal verifier for smart contracts 
written in the Move programming language. MVP has an expressive specifi- 
cation language, and is fast and reliable enough that it can be run routinely by 
developers and in integration testing. Besides the simplicity of smart contracts 
and the Move language, three implementation approaches are responsible for 
the practicality of MVP: (1) an alias-free memory model, (2) fine-grained in- 
variant checking, and (3) monomorphization. The entirety of the Move code 
for the Diem blockchain has been extensively specified and can be completely 
verified by MVP in a few minutes. Changes in the Diem framework must be 
successfully verified before being integrated into the open source repository 
on GitHub. 


Keywords: Smart contracts - formal verification - Move language - Diem blockchain 


1 Introduction 


The Move Prover (MVP) is a formal verification tool for smart contracts that in- 
tends to be used routinely during code development. The verification finishes fast 
and predictably, making the experience of running MVP similar to the experience 
of running compilers, linters, type checkers, and other development tools. Build- 
ing a fast verifier is non-trivial, and in this paper, we would like to share the most 
important engineering and architectural decisions that have made this possible. 
One factor that makes verification easier is applying it to smart contracts. Smart 
contracts are easier to verify than conventional software for at least three reasons: 
1) they are small in code size, 2) they execute in a well-defined, isolated environ- 
ment, and 3) their computations are typically sequential, deterministic, and have 
minimal interactions with the environment (e.g., no explicit I/O operations). At the 
same time, formal verification is more appealing to the advocates for smart contracts 
because of the large financial and regulatory risks that smart contracts may entail if 
misbehaved, as evidenced by large losses that have occurred already [29,19,22]. 
The other crucial factor to the success of MVP is a tight coupling with the Move 
programming language [26]. Move is developed as part of the Diem blockchain [24] 
and is designed to be used with formal verification from day one. Move is currently 
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co-evolving with MVP. The language supports specifying pre-, post-, and aborts con- 
ditions of functions, as well as invariants over data structures and over the content 
of the global persistent memory (i.e., the state of the blockchain). One feature that 
makes verification harder is that quantification can be used freely in specifications. 

Despite this specification richness, MVP is capable of verifying the full Move 
implementation of the Diem blockchain (called the Diem framework [25]) in a few 
minutes. The framework provides functionality for managing accounts and their in- 
teractions, including multiple currencies, account roles, and rules for transactions. It 
consists of about 8,800 lines of Move code and 6,500 lines of specifications (includ- 
ing comments for both), which shows that the framework is extensively specified. 
More importantly, verification is fully automated and runs continuously with unit and 
integration tests, which we consider a testament to the practicality of the approach. 
Running the prover in integration tests requires more than speed: it requires re- 
liability, because tests that work sometimes and fail or time out other times are 
unacceptable in that context. 

MVP is a substantial and evolving piece of software that has been tuned and 
optimized in many ways. As a result, it is not easy to define exactly what imple- 
mentation decisions lead to fast and reliable performance. However, we can at least 
identify three major ideas that resulted in dramatic improvements in speed and re- 
liability since the description of an early prototype of MVP [32]: 


— an alias-free memory model based on Move’s semantics, which are similar to the 
Rust programming language; 

— fine-grained invariant checking, which ensures that invariants hold at every state, 
except when developer explicitly suspends them; and 

— monomorphization, which instantiates type parameters in Move’s generic struc- 
tures, functions, and specification properties. 


The combined effect of all these improvements transformed a tool that worked, 
but often exhibited frustrating, sometimes random [12], timeouts on complex and 
especially on erroneous specifications, to a tool that almost always completes in less 
than 30 seconds. In addition, there have been many other improvements, including a 
more expressive specification language, reducing false positives, and error reporting. 

The remainder of the paper first introduces the Move language and how MVP 
is used with it, then discusses the design of MVP and the three main optimizations 
above. There is also an appendix that describes injection of function specifications. 


2 Move and the Prover 


Move was developed for the Diem blockchain [24], but its design is not specific to 
blockchains. A Move execution consists of a sequence of updates evolving a global 
persistent memory state, which we just call the (global) memory. Similar to other 
blockchains, updates are a series of atomic transactions. All runtime errors result in 
a transaction abort, which does not change the blockchain state except to transfer 
some currency (“gas”) from the account that sent the transaction to pay for cost of 
executing the transaction. 
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Fig. 1: Account Example Program 


module Account { 
struct Account has key { 
balance: u64, 


} 


fun withdraw(account: address, amount: u64) acquires Account { 
let balance = &mut borrow_global_mut <Account>(account).balance; 
assert(*xbalance >= amount, Errors:: limit_exceeded()); 
*xbalance = xbalance - amount; 


} 


fun deposit(account: address, amount: u64) acquires Account { 
let balance = &mut borrow_global_mut <Account>(account).balance; 
assert(*xbalance <= Limits::max_u64() - amount, Errors:: limit_exceeded()); 
*xbalance = xbalance + amount; 


} 


public(script) fun transfer(from: &signer, to: address, amount: u64) 
acquires Account { 
assert(Signer::address_of(from) != to, Errors::invalid_argument ()); 
withdraw(Signer::address_of(from), amount); 
deposit(to, amount); 
} 
} 


The global memory is organized as a collection of resources, described by Move 
structures (data types). A resource in memory is indexed by a pair of a type and an 
address (for example the address of a user account). For instance, the expression 
exists<Coin<USD>>(addr) will be true if there is a value of type Coin<USD> stored 
at addr. As seen in this example, Move uses type generics, and working with generic 
functions and types is rather idiomatic for Move. 


A Move application consists of a set of transaction scripts. Each script defines 
a Move function with input parameters but no output parameters. This function 
updates the global memory and may emit events. The execution of this function can 
abort because of an abort instruction or implicitly because of a runtime error such 
as an out-of-bounds vector index. 


Programming in Move In Move, one defines transactions via script functions which 
take a set of parameters. Those functions can call other functions. Script and regu- 
lar functions are encapsulated in modules. Move modules are also the place where 
structs are defined. An illustration of a Move contract is given in Fig. 1 (for a more 
complete description see the Move Book [26]). The example is a simple account 
which holds a balance in the struct Account, and offers the script function transfer 

to manipulate this resource. Scripts generally have signer arguments, which are 
tokens which represent an account address that has been authenticated by a crypto- 
graphic signature. The assert statement in the example causes a Move transaction 
to abort execution if the condition is not met. Notice that Move, similar as Rust, sup- 
ports references (as in &signer) and mutable references (as in &mut T). However, 
references cannot be part of structs stored in global memory. 
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Fig. 2: Account Example Specification 


module Account { 
spec transfer { 
let from_addr = Signer::address_of (from); 
aborts_if from_addr == to; 
aborts_if bal(from_addr) < amount; 
aborts_if bal(to) + amount > Limits::max_u64(); 
ensures bal(from_addr) == old(bal(from_addr)) - amount; 
ensures bal(to) == old(bal(to)) + amount; 


} 


spec fun bal(acc: address): u64 { 
global <Account>(acc).balance 


} 


invariant forall acc: address where exists<Account>(acc): 
bal(acc) >= AccountLimits::min_balance(); 


invariant update forall acc: address where exists<Account >(acc): 
old(bal(acc)) - bal(acc) <= AccountLimits::max_decrease(); 


} 


Specifying in Move The specification language supports Design By Contract [18]. 
Developers can provide pre and post conditions for functions, which include condi- 
tions over parameters and global memory. Developers can also provide invariants 
over data structures, as well as the contents of the global memory. Universal and 
existential quantification over bounded domains, such as like the indices of a vector, 
as well as effectively unbounded domains, such as memory addresses and integers, 
are supported. Quantifiers make the verification problem undecidable and cause dif- 
ficulties with timeouts. However, in practice, we notice that quantifiers have the ad- 
vantage of allowing more direct formalization of many properties, which increases 
the clarity of specifications. 

Fig. 2 illustrates the specification language by extending the account example in 
Fig. 1 (for the definition of the specification language see [27]). This adds the spec- 
ification of the transfer function, a helper function bal for use in specs, and two 
global memory invariants. The first invariant states that a balance can never drop 
underneath a certain minimum. The second invariant refers to an update of global 
memory with pre and post state: the balance on an account can never decrease in 
one step more than a certain amount. Note that while the Move programming lan- 
guage has only unsigned integers, the specification language uses arbitrary precision 
signed integers, making it convenient to specify something like x + y <= limit, 
without the complication of arithmetic overflow. 

Specifications for the withdraw and deposit functions have been omitted in this 
example. MVP supports omitting specs for non-recursive functions, in which case 
they are treated as being inlined at caller site. 


Running the Prover MVP is fully automatic, like a type checker or linter, and 
is expected to finish in a reasonable time, so it can be integrated in the regular 
development workflow. Running MVP on the module Account produces multiple 
errors. The first is this one: 
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Fig. 3: Move Prover Architecture 
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MVP detected that an implicit abort condition is missing in the specification of the 
withdraw function. It prints the context of the error, as well as an execution trace 
which leads to the error. Values of variable assignments from the counterexample 
found by the SMT solver are printed together with the execution trace. Logically, 
the counterexample presents an assignment to variables where the program fails to 
meet the specification. In general, MVP attempts to produce readable diagnostics 
for Move developers without the need of understanding any internals of the prover. 

There are more verification errors in this example, related to the global in- 
variants: the code makes no attempt to respect the limits in min_balance() and 
max_decrease(). The problem can be fixed by adding more assert statements to 
check that the limits are met (see full version of the paper [7]). 

The programs and specifications MVP deals with are much larger than this ex- 
ample. The conditions under which a transaction in the Diem framework can abort 
typically involve dozens of individual predicates, stemming from other functions 
called by this transaction. Moreover, there are hundreds of memory invariants spec- 
ified, encoding access control and other requirements for the Diem blockchain. 


3 Move Prover Design 


The architecture of MVP is illustrated in Fig. 3. Move code (containing specifica- 
tions) is given as input to the tool chain, which produces two artifacts: an abstract 
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syntax tree (AST) of the specifications, and the generated bytecode. The Move Model 
merges both bytecode and specifications, as well as other metadata from the original 
code, into a unified object model which is input to the remaining tool chain. 

The next phase is the actual Prover Compiler, which is a pipeline of bytecode 
transformations. We focus on the transformations shown (Reference Elimination, 
Specification Injection, and Monomorphization). The Prover uses a modified version 
of the Move VM bytecode as an intermediate representation for these transforma- 
tions, but, for clarity, we describe the transformations at the Move source level. 

The transformed bytecode is next compiled into the Boogie intermediate verifi- 
cation language [3]. Boogie supports an imperative programming model which is 
well suited for the encoding of the transformed Move code. Boogie in turn can trans- 
late to multiple SMT solver backends, namely Z3 [20] and CVC5 [23]; the default 
choice for the Move prover is currently Z3. 


3.1 Reference Elimination 


The reference elimination transformation is what enables the alias-free memory 
model in the Move Prover, which is one of the most important factors contributing 
to the speed and reliability of the system. In most software verification and static 
analysis systems, the explosion in number of possible aliasing relationships between 
references leads either to high computational complexity or harsh approximations. 

In Move, the reference system is based on borrow semantics [5] as in the Rust 
programming language. The initial borrow must come from either a global memory 
or a local variable on stack (both referred to as locations from now on). For local 
variables, one can create immutable references (with syntax &x) and mutable refer- 
ences (with syntax &mut x). For global memories, the references can be created via 
the borrow_global and borrow_global_mut built-ins. Given a reference to a whole 
struct, field borrowing can occur via &mut x.f and &x.f. Similarly, with a reference 
to a vector, element borrowing occurs via native functions Vector: :borrow(v, i) 
and Vector: :borrow_mut(v, i). Move provides the following guarantees, which 
are enforced by the borrow checker: 


— For any location, there can be either exactly one mutable reference, or n im- 
mutable references. Enforcing this rule is similar to enforcing the borrow seman- 
tics in Rust, except for global memories, which do not exist in Rust. For global 
memories, this rule is enforced via the acquires annotations. Using Fig. 1 as an 
example, function withdraw acquires the Account global location, therefore, 
withdraw is prohibited from calling any other function that might also borrow 
or modify the Account global memory (e.g., deposit). 

— The lifetime of references to data on the stack cannot exceed the lifetime of 
the stack location. This includes global memories borrowed inside a function 
as well—a reference to a global memory cannot be returned from the function, 
neither in whole nor in parts. 


These properties effectively permit the elimination of references from a Move pro- 
gram, eliminating need to reason about aliasing. 
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Immutable References Immutable references are replaced by values. An example 
of the applied transformation is shown below. We remove the reference type con- 
structor and all reference-taking operations from the code: 


fun select_f(s: &S): &T { &&.f } ™ fun select_f(s: S): T { s.f} 


When executing a Move program, immutable references are important to avoid 
copies for performance and to enforce ownership; however, for symbolic reason- 
ing on correct Move programs, the distinction between immutable references and 
values is unimportant. 


Mutable References Each mutation of a location 1 starts with an initial borrow for 
the whole data stored in this location. This borrow creates a reference r. As long as 
r is alive, Move code can either update its value (xr = v), or derive a sub-reference 
(r’ = &mut r.f). The mutation ends when r (and the derived r’) go out of scope. 

The borrow checker guarantees that during the mutation of the data in 1, no 
other reference can exist into the same data in 1 — meaning that it is impossible for 
other Move code to test whether the value has mutated while the reference is held. 

These semantics allow mutable references to be handled via read-update-write 
cycles. One can create a copy of the data in 1 and perform a sequence of mutation 
steps which are represented as purely functional data updates. Once the last refer- 
ence for the data in 1 goes out of scope, the updated value is written back to 1. This 
converts an imperative program with references into an imperative program which 
only has state updates on global memory or variables on the stack, with no aliasing. 
We illustrate the basics of this approach by an example: 

fun increment(x: &mut u64) { *x = *x + 1 } 


fun increment_field(s: &mut S) { increment (&mut s.f) } 
fun caller(): S { let s = S{f:0}; update(&mut s); s } 


fun increment(x: u64): u64 { x + 1 } 
fun increment_field(s: S): S { s[f = increment(s.f)] } 
fun caller(): S { let s = S{f:0}; s = update(s); s } 


Dynamic Mutable References While the setup in above example covers a majority 
of the use cases in every day Move code, the general case is more complex, since the 
referenced location may not be known statically. Consider the following Move code: 


let r = if (p) &mut s1 else &mut s2; 
increment_field(r); 


Additional information in the logical encoding is required to deal with such cases. 
When a reference goes out of scope, we need to know from which location it was 
derived in order to write back the updated value. Fig. 4 illustrates the approach for 
doing this. Essentially, a new type Mut<T>, which is internal to MVP, is introduced 
to track both the location from which T was derived and the value of T. Mut<T> 
supports the following operations: 


— Mvp::mklocal(value, LOCAL_ID) creates anew mutation value for a local with 
the given local id. A local id uniquely identifies a local variable in the function. 
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Fig. 4: Elimination of Mutable References 


fun increment(x: &mut u64) { *x = *x + 1} 
fun increment_field(s: &mut S) { 
let r = if (s.f > 0) &mut s.f else &mut s.g; 
increment(r) 
} 
fun caller(p: bool): (S, S) { 
let s1 = S{f:0, g:0}; let s2 = S{f:1, g:1}; 
let r = if (p) &mut s1 else &mut s2; 
increment_field(r); 
(s1, s2) 
} 
ww 
fun increment(x: Mut<u64>): Mut<u64> { Mvp::set(x, Mvp::get(x) + 1) } 
fun increment_field(s: Mut<S>): Mut<S> { 
let r = if (s.f > 0) Mvp::field(s.f, S_F) else Mvp::field(s.g, S_G); 
r = increment(r); 
if (Mvp::is_field(r, S_F)) 
s = Mvp::set(s, Mvp::get(s)[f = Mvp::get(r)]); 
if (Mvp::is_field(r, S_G)) 
s = Mvp::set(s, Mvp::get(s)[g = Mvp::get(r)]); 
s 
} 
fun caller(p: bool): S { 
let s1 = S{f:0, g:0}; let s2 = S{f:1, g:1}; 
let r = if (p) Mvp::mklocal(s1, CALLER_s1) 
else Mvp::mklocal(s2, CALLER_s2); 
r = increment_field(r); 
if (Mvp::is_local(r, CALLER_s1)) 
s1 = Mvp::get(r); 
if (Mvp::is_local(r, CALLER_s2)) 
s2 = Mvp::get(r); 
(s1, s2) 
} 


Similarly, Mvp: :mkglobal (value, TYPE_ID, addr) creates a new mutation for 
a global with given type and address. 

— With r’ = Mvp::field(r, FIELD_ID) a mutation value for a sub-reference is 
created for the identified field. 

The value of a mutation is replaced with r’ = Mvp::set(r, v) and retrieved 
with v = Mvp::get(r). 

— With the predicate Mvp: :is_local(r, LOCAL_ID) one can test whether r was 
derived from the given local, and with Mvp::is_global(r, TYPE_ID, addr) 
for a specific global location. Mvp: :is_field(r, FIELD_ID) tests whether r is 
derived from the given field. 


MVP implements the illustrated transformation by construction a borrow graph 
from the program via data flow analysis. This graph tracks both when references 
are released as well as how they relate to each other: e.g. r’ = &mut r.f creates an 
edge from r to r’ labeled with f, and rr’ = &mut r.g creates another also starting 
from r. The borrow analysis is inter-procedural, requiring computed summaries for 
the borrow graph of called functions. 

The resulting borrow graph is then used to guide the transformation, inserting 
the operations of the Mut<T> type as illustrated in Fig 4. Specifically, when the bor- 
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row on a reference ends, the associated mutation value must be written back to its 
parent mutation or the original location (e.g. line 29 in Fig. 4). The presence of mul- 
tiple possible origins leads to case distinctions via Mvp: : is_X predicates; however, 
these cases are rare in actual Move programs. 


3.2 Global Invariant Injection 


Correctness of smart contracts is largely about the correctness of the blockchain 
state, so global invariants are particular important in the move specification lan- 
guage. For example, in the Diem framework, global invariants can capture the re- 
quirement that an account be accompanied by various other types that are be stored 
at the same address and the requirement certain state changes are only permitted 
for certain accounts by the access control scheme. 

Most software verification tools prove that functions preserve invariants by as- 
suming the invariant at the entry to each function and proving them at the exit. Ina 
module or class, it is only necessary to prove that invariants are preserved by public 
functions, since invariants are often violated internally in the implementation of a 
module or class. An earlier version of the Move Prover used exactly this approach. 

The current implementation of the Prover takes the opposite approach: it ensures 
that invariants hold after every instruction, unless explicitly directed to suspend 
some invariants by a user. This fine-grained approach has performance advantages, 
because, unless suspended, invariants are only proven when an instruction is executed 
that could invalidate them, and the proofs are often computationally simple because 
the change from a single instruction is usually small. Relatively few invariants are 
suspended, and, when they are, it is over a relatively small span of instructions, 
preserving these advantages. There is another important advantage, which is that 
invariants hold almost everywhere in the code, so they are available to approve 
other properties, such as abort conditions. For example, if a function accesses type 
T1 and then type T2, the access to T2 will never abort if the presence of T1 implies 
the presence of T2 at every state in the body of the function. This situation occurs 
with some frequency in the Diem framework. 


Invariant Types and Proof Methodology Inductive invariants are properties de- 
clared in Move modules that must (by default) hold for the global memory at all 
times. Those invariants often quantify over addresses (See Fig. 2 for example.) Based 
on Move’s borrow semantics, inductive invariants don’t need to hold while memory 
is mutated because the changes are not visible to other code until the change is 
written back. This is reflected by the reference elimination described in Sec. 3.1, 
Update invariants are properties that relate two states, a previous state and the 
current state. Typically they are enforced after an update of global memory. The old 
operator is used to evaluate specification expressions in the previous state. 
Verification of both kinds of invariants can be suspended. That means, instead of 
being verified at the time a memory update happens, they are verified at the call site 
of the function which updates memory. This feature is necessitated by fine-grained 
invariant checking, because invariants sometimes do not hold in the midst of internal 
computations of a module. For example, a relationship between state variables may 
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Fig. 5: Basic Global Invariant Injection 


fun f(a: address) { 
let r = borrow_global_mut <S>(a); 
r.value = r.value + 1 


} 

invariant [I1] forall a: address: global<S>(a).value > ð; 

invariant [I2] update forall a: address: 
global<S>(a).value > old(global<S>(a).value); 

ww 


fun f(a: address) { 
spec assume Il; 
Mvp:: Snapshot_state(I2_BEFORE); 
r = <increment mutation>; 
spec assert Il; 
spec assert I2[old = I2_BEFORE]; 
} 


not hold when the variables are being updated sequentially. Functions with external 
callers (public or script functions) cannot suspend invariant verification, since the 
invariants are assumed to hold at the beginning and end of each such function. 

Inductive invariants are proven by induction over the evolution of the global 
memory. The base case is that the invariant must hold in the empty state that pre- 
cedes the genesis transaction. For the induction step, we can assume that the invari- 
ant holds at each verified function entry point for which it is not suspended, and 
now must prove that it holds after program points which are either direct updates 
of global memory, or calls to functions which suspend invariants. 

For update invariants, no induction proof is needed, since they just relate two 
memories. The pre-state is some memory captured before an update happens, and 
the post state the current state. 


Modular Verification We wish to support open systems to which untrusted modules 
can be added with no chance of violating invariants that have already been proven. 
For each invariant, there is a defined subset of Move modules (called a cluster). If the 
invariant is proven for the modules in the cluster, it is guaranteed to hold in all other 
modules — even those that were not yet defined when the invariant was proven. The 
cluster must contain every function that can invalidate the invariant, and, in case 
of invariant suspension, all callers of such a function. Importantly, functions outside 
the cluster can never invalidate an invariant. Those functions trivially preserve the 
invariant, so it is only necessary to verify functions defined in the cluster. 

MVP verifies a given set of modules at a time (typically one). The modules being 
verified are called the target modules, and the global invariants to be verified are 
called target invariants, which are all invariants defined in the target modules. The 
cluster is then the smallest set as specified above such that all target modules are 
contained. 


Basic Translation We first look at injection of global invariants in the absence of 
type parameters. Fig. 5 contains an example for the supported invariant types and 
their injection into code. The first invariant, 11, is an inductive invariant. It is as- 
sumed on function entry, and asserted after the state update. The second, I2, is an 
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Fig. 6: Global Invariant Injection and Genericity 


invariant [I1] global<S<u64>>(@).value > 1; 
invariant<T> [I2] global<S<T>>(0).value > @; 
fun f(a: address) { borrow_global_mut <S<u8>>(@).value = 2 } 
fun g<R>(a: address) { borrow_global_mut <S<R>>(0).value = 3 
w> 
fun f(a: address) { 

spec assume I2[T = u8]; 

<<mutate>> 

spec assert I2[T = u8]; 


} 


} 

fun g<R>(a: address) { 
spec assume Il; spec assume I2[T = R]; 
<<mutate>> 
spec assert Il; spec assert I2[T = R]; 


} 


update invariant, which relates pre and post states. For this a state snapshot is stored 
under some label I2_BEFORE, which is then used in an assertion. 

Global invariant injection is optimized by knowledge of the prover, obtained 
by static analysis, about accessed and modified memory. Let accessed(f) be the 
memory accessed by a function, and modified(f) be the memory modified. Let 
accessed(1) by an invariant (including transitively by all functions it calls). 


— Inject assume I at entry to f if accessed(f) has overlap with accessed(I). 

- Inject assert I after each program step if one of the following is true (a) the 
step modifies a memory location M in accessed(I) or, (b) the step is a call to 
function f’ in which I is suspended and modifies(f’) intersects with accessed 
(1). Also, if I is an update invariant, inject a save of a memory snaptshot before 
the update or call. 


Genericity Generic type parameters make the problem of determining whether a 
function can modify an invariant more difficult. Consider the example in Fig. 6. 
Invariant 11 holds for a specific type instantiation S<u64>, whereas I2 is generic 
over all type instantiations for S<T>. 

The non-generic function f which works on the instantiation S<u8> will have to 
inject the specialized instance I2[T = u8]. The invariant 11, however, does not apply 
for this function, because there is no overlap with S<u64>. In contrast, g is generic 
in type R, which could be instantiated to u64. So, 11, which applies to S<u64> needs 
to be injected in addition to 12. 

The general solution depends on type unification. Given the accessed memory 
of a function f<R> and an invariant I<T>, we compute the pairwise unification of 
memory types. Those types are parameterized over R resp. T. Successful unification 
results in a substitution for both type parameters, and we include the invariant with 
T specialized according to the substitution. 


3.3 Monomorphization 


Monomorphization is a transformation which removes generic types from a Move 
program by specializing the program for relevant type instantiations. In the context 
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Fig. 7: Basic Monomorphization 


struct S<T> { .. } 

fun f<T>(x: T) { g<S<T>>(S(x)) } 

fun g<S:key>(s: S) { move_to<S>(.., s) } 
w> 

struct T{} 

struct S_T{ .. } 

fun f_T(x: T) { g_S_T(S_T(x)) } 

fun g_S_T(s: S_T) { move_to<S_T>(.., s) } 


of verification, the goal is that the specialized program verifies if and only if the 
generic program verifies in an encoding which supports types as first class values. 
We expect the specialized program to verify faster because it avoids the problem 
of generic representation of values, supporting a multi-sorted representation in the 
SMT logic. 

To verify a generic function for all possible instantiations, monomorphization 
skolemizes the type parameter, i.e. the function is verified for a new type with no 
special properties that represents an arbitrary type. It then specializes all called func- 
tions and used data types with this new type and any other concrete types they may 
use. Fig. 7 sketches this approach. 

However, this approach has one issue: the type of genericity Move provides does 
not allow for full type erasure (unlike many programming languages) because types 
are used to index global memory (e.g. global<S<T>>(addr) where T is a generic 
type). Consider the following Move function: 


fun f<T>(..) { move_to<S<T>>(s, ..); move_to<S<u64>>(s, ..) } 


Depending on how T is instantiated, this function behaves differently. Specifically, 
if T is instantiated with u64 the function will always abort at the second move_to, 
since the target location is already occupied. 

The important property enabling monomorphization in the presence of such type 
dependent code is that one can identify the situation by looking at the memory ac- 
cessed by code and injected specifications. From this one can derive additional in- 
stantiations of the function which need to be verified. In the example above, verifying 
both f_T and an instantiation f_u64 will cover all relevant cases of the function be- 
havior. 

The algorithm for computing the instances that require verification works as 
follows. Let f<T1,..,Tn> be a verified target function which has all specifications 
injected and inlined function calls expanded. 


— For each memory M in modified(f), if there isa memoryM’ in modified(f) 
+ accessed(f) such that M and M’ can unify via T1,..,Tn, collect an instantia- 
tion of the type parameters Ti from the resulting substitution. This instantiation 
may not assign values to all type parameters, and those unassigned parameters 
stay as is. For instance, f<T1, 12> might have a partial instantiation f<T1, u8>. 

- Once all partial instantiations are computed, the set is extended by unifying the 
instantiations against each other. If <T> and <T’> are in the set, and they unify 
under the substitution s, then <s(T)> will also be part of the set. For example, 
consider f<T1, T2> which modifies M<T1> and R<T2>, as well as accesses M<u64> 
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and R<u8>. From this the instantiations <u64, T2>and<T1, u8> are computed, 
and the additional instantiation <u64, u8> will be added to the set. 

- If after computing and extending instantiations any type parameters remain, 
they are skolemized into a given type as described earlier. 


To understand the correctness of this procedure, consider the following arguments 
(a full formal proof is outstanding): 


- Direct interaction Whenever a modified memory M<t> can influence the interpre- 
tation of M<t’>, a unifier must exist for the types t and t’, and an instantiation 
will be verified which covers the overlap of t and t’. 

- Indirect interaction If there is an overlap between two types which influences 
whether another overlap is semantically relevant, the combination of both over- 
laps will be verified via the extension step. 


Notice that even though it is not common in regular Move code to work with 
both memory S<T> and, say, S<u64> in one function, there is a scenario where such 
code is implicitly created by injection of global invariants. Consider the example in 
Fig. 6. The invariant 11 which works on S<u64> is injected into the function g<R> 
which works on S<R>. When monomorphizing g, we need to verify an instance g_u64 
in order to ensure that I1 holds. 


4 Analysis 


Reliability and Performance The three improvements described above resulted in 
a major qualitative change in performance and reliability. In the version of MVP 
released in September 2020, correct examples verified fairly quickly and reliably. 
But that is because we needed speed and reliability, so we disabled some properties 
that always timed out and others that timed out unpredictably when there were 
small changes in the framework. We learned that incorrect programs or specifica- 
tions would time out predictably enough that it was a good bet that examples that 
timed out were erroneous. However, localizing the error to fix it was very hard, be- 
cause debugging is based on a counterexample that violates the property, and getting 
a counterexample requires termination! 

With each of the transformations described, we witnessed significant speedups 
and, more importantly, reductions in timeouts. Monomorphization was the last fea- 
ture implemented, and, with it, timeouts almost disappeared. Although this was the 
most important improvement in practice, it is difficult to quantify because there have 
been many changes in Diem framework, its specifications, MVP, and even the Move 
language over that time. 

It is simpler (but less important) to quantify the changes in run time of MVP 
on one of our more challenging modules, the DiemAccount module, which is the 
biggest module in the Diem framework. This module implements basic function- 
ality to create and maintain multiple types of accounts on the blockchain, as well 
as manage their coin balances. It was called LibraAccount in release 1.0 of MVP, 
and is called DiemAccount today. The comparison requires various patches as de- 
scribed in [17]. The table below lists the consolidated numbers of lines, functions, 
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invariants, conditions (requires, ensures, and aborts-if), as well as the verification 
times: 


Module Lines Functions Invariants Conditions Timing 
LibraAccount 1975 72 10 113 9.899s 
DiemAccount 2554 64 32 171 7.340s 


Notice that DiemAccount has significantly grown in size compared to the older ver- 
sion. Specifically, additional specifications have been added. Moreover, in the origi- 
nal LibraAccount, some of the most complex functions had to be disabled for ver- 
ification because the old version of MVP would time out on them. In contrast, in 
DiemAccount and with the new version, all functions are verified. Verification time 
has been improved by roughly 20%, in the presence of three times more global invari- 
ants, and 50% more function conditions. 

We were able to observe similar improvements for the remaining of the 40 mod- 
ules of the Diem framework. All of the roughly half-dozen timeouts resolved after 
introduction of the transformations described in this paper. 


Causes for the Improvements It’s difficult to pin down and measure exactly why 
the three transformations described improved performance and reliability so dra- 
matically. We have explained some reasons in the subsections above: the alias-free 
memory model reduced search through combinatorial sharing arrangments, and the 
fine-grained invariant checking results in simpler formulas for the SMT solver. 

We found that most timeouts in specifications stemmed from our liberal use of 
quantifiers. To disprove a property Pp after assuming a list of properties, P4,... Pn, 
the SMT solver must show that ~P) A P; A ... A P, is satisfiable. The search usu- 
ally involves instantiating universal quantifiers in P,,...,P,,. The SMT solver can do 
this endlessly, resulting in a timeout. Indeed, we often found that proving a post- 
condition false would time out, because the SMT solver was instantiating quanti- 
fiers to find a satisfying assignment of P} A... A P,,. Simpler formulas result in fewer 
intermediate terms during solving, resulting in fewer opportunities to instantiate 
quantified formulas. 

We believe that one of the biggest impacts, specifically on removing timeouts and 
improving predictability, is monomorphization. The reason for this is that monomor- 
phization allows a multi-sorted representation of values in Boogie (and eventually 
the SMT solver). In contrast, before monomorphization, we used a universal domain 
for values in order to represent values in generic functions, roughly as follows: 


type Value = Num(int) | Address(int) | Struct(Vector<Value>) | 


This creates a large overhead for the SMT solver, as we need to exhaustively in- 
ject type assumptions (e.g. that a Value is actually an Address), and pack/un- 
pack values. Consider a quantifier like forall a: address: P(x) in Move. Be- 
fore monomorphization, we have to represent this in Boogie as forall a: Value: 

is#Address(a)=> P(v#Address(a)). This quantifier is triggered where ever is# 
Address(a) is present, independent of the structure of P. Over-triggering or inad- 
equate triggering of quantifiers is one of the suspected sources of timeouts, as also 
discussed in [12]. 
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Moreover, before monomorphization, global memory was indexed in Boogie by 
an address and a type instantiation. That is, for struct R<T> we would have one 
Boogie array [Type, int]Value. With monomorphization, the type index is elim- 
inated, as we create different memory variables for each type instantiation. Quan- 
tification over memory content works now on a one-dimensional instead of an n- 
dimensional Boogie array. 


Discussion and Related Work Many approaches have been applied to the verifica- 
tion of smart contracts; see e.g. the surveys [14,29]. [29] refers to at least two dozen 
systems for smart contract verification. It distinguishes between contract and pro- 
gram level approaches. Our approach has aspects of both: we address program level 
properties via pre/post conditions, and contract (“blockchain state”) level properties 
via global invariants. To the best of our knowledge, among the existing approaches, 
the Move ecosystem is the first one where contract programming and specification 
language are fully integrated, and the language is designed from first principles in- 
fluenced by verification. Methodologically, Move and the Move prover are thereby 
closer to systems like Dafny [11], or the older Spec# system [4], where instead of 
adding a specification approach posterior to an existing language, it is part from the 
beginning. This allows us not only to deliver a more consistent user experience, but 
also to make verification technically easier by curating the programming language. 

In contrast to other approaches that only focus on specific vulnerability pat- 
terns [6,15,21,31], MVP offers a universal specification language. To the best of 
our knowledge, no existing specification approach for smart contracts based on in- 
ductive Hoare logic has similar expressiveness. We support universal quantification 
over arbitrary memory content, a suspension mechanism of invariants to allow non- 
atomic construction of memory content, and generic invariants. For comparison, 
the SMT Checker build into Solidity [8,9,10] does not support quantifiers, because 
it interprets programming language constructs (requires and assert statements) as 
specifications and has no dedicated specification language. While in Solidity one can 
simulate aspects of global invariants using modifiers by attaching pre/post condi- 
tions, this is not the same as our invariants, which are guaranteed to hold indepen- 
dent of whether a user may or (accidentally) may not attach a modifier, and which 
are optimized to be only evaluated as needed. 

While the expressiveness of Move specifications comes with the price of unde- 
cidability and the dependency from heuristics in SMT solvers, MVP deals with this 
by its elaborated translation to SMT logic, as described in this paper. The result 
is a practical verification system that is fully integrated into the Diem blockchain 
production process, running in continuous integration, which is (to the best of our 
knowledge) a first in the industry. 

The individual techniques we described are novel each by themselves. Reference 
elimination relies on borrow semantics, similar as in the Rust [16] language. We ex- 
pect reference elimination to apply for the safe subset of Rust, though some extra 
work would be needed to deal with references aggregated by structs. However, we 
are not aware of that something similar has been attempted in existing Rust veri- 
fication work [1,2,13,30]. Global invariant injection and the approach to minimize 
the number of assumptions and assertions is not applied in any existing verification 
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approach we know of; however, we co-authored a while ago a similar line of work 
for runtime checking of invariants in Spec# [28], yet that work never left the concep- 
tual state. Monomorphization is well known as a technique for compiling languages 
like C++ or Rust, where it is called specialization; however, we are not aware of it 
being generalized for modular verification of generic code where full type erasure 
is not possible, as it is the case in Move. 


Future Work MVP is conceived as a tool for achieving higher assurance systems, not 
as a bug hunting tool. Having at least temporarily achieved satisfactory performance 
and reliability, we are turning our attention to the question of the goal of higher 
assurance, which raises several issues. If we’re striving for high assurance, it would 
be great to be able to measure progress towards that goal. Since system requirements 
often stem from external business and regulatory needs, lightweight processes for 
exposing those requirements so we know what needs to be formally specified would 
be highly desirable. 

As with many other systems, it is too hard to write high-quality specifications. 
Our current specifications are more verbose than they need to be, and we are work- 
ing to require less detailed specifications, especially for individual functions. We 
could expand the usefulness of MVP for programmers if we could make it possi- 
ble for them to derive value from simple reusable specifications. Finally, software 
tools for assessing the consistency and completeness of formal specifications would 
reduce the risk of missing bugs because of specification errors. 

However, as more complex smart contracts are written and as more people write 
specifications, we expect that the inherent computational difficulty of solving logic 
problems will reappear, and there will be more opportunities for improving perfor- 
mance and reliability. In addition to translation techniques, it will be necessary to 
identify opportunities to improve SMT solvers for the particular kinds of problems 
we generate. 


5 Conclusion 


We described key aspects of the Move prover (MVP), a tool for formal verification 
of smart contracts written in the Move language. MVP has been successfully used 
to verify large parts of the Diem framework, and is used in continuous integration 
in production. The specification language is expressive, specifically by the powerful 
concept of global invariants. We described key implementation techniques which 
(as confirmed by our benchmarks) contributed to the scalability of MVP. One of the 
main areas of our future research is to improve specification productivity and reduce 
the effort of reading and writing specs, as well as to continue to improve speed and 
predictability of verification. 
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Abstract. Superoptimization is a compilation technique that searches 
for the optimal sequence of instructions semantically equivalent to a given 
(loop-free) initial sequence. With the advent of SMT solvers, it has been 
successfully applied to LLVM code (to reduce the number of instructions) 
and to Ethereum EVM bytecode (to reduce its gas consumption). Both 
applications, when proven practical, have left out memory operations and 
thus missed important optimization opportunities. A main challenge to 
superoptimization today is handling memory operations while remaining 
scalable. We present GASOL”?, a gas and bytes-size superoptimization 
tool for Ethereum smart contracts, that leverages a previous Max-SMT 
approach for only stack optimization to optimize also wrt. memory and 
storage. GASOL”? can be used to optimize the size in bytes, aligned with 
the optimization criterion used by the Solidity compiler solc, and it can 
also be used to optimize gas consumption. Our experiments on 12,378 
blocks from 30 randomly selected real contracts achieve gains of 16.42% in 
gas wrt. the previous version of the optimizer without memory handling, 
and gains of 3.28% in bytes-size over code already optimized by solc. 


1 Introduction and Related Work 


Superoptimization is an automated technique for code optimization that was 
proposed back in 1987 [20]. It aims at automatically finding the optimal (wrt. 
the considered optimization criteria) instruction sequence —which is semanti- 
cally equivalent— to a given sequence of loop-free instructions. It differs from 
traditional optimization techniques in that it uses search rather than applying 
pre-cooked transformations. However, as it requires exhaustive search in the 
space of valid instruction sequences, it suffers from high computation demands 
and it was considered impractical for many years. The first attempts of applying 
superoptimization were within a GNU C compiler back in the nineties [15] and, 
later, it has also been applied for an x86-64 assembly language [10, 11]. 

There is a recent revival of superoptimization due to the availability of 
SMT solvers which offer powerful techniques to handle enumerative search and 
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to check semantic equivalence. The approaches to supercompilation based on 
SMT can be roughly classified into two types: (1) Those that use an external 
synthesis algorithm with pruning techniques, such as [9, 12,17], and that invoke 
the SMT solver to solve certain queries. This is the approach of the Souper 
superoptimizer [22] that relies on the synthesis algorithm for loop-free programs 
of Gulwani et al. [17]; (2) Those that directly produce an SMT encoding of the 
problem and use the search engine of the solver. This is the approach of [18], 
EBSO [21] and SYRUP [7]. Both types of approaches have been proven to be 
practical on their own settings and optimization criteria: the analysis of blocks 
does not reach the timeout of 10 sec in 90% of the cases [7] in SYRUP, and 
Souper optimized three million lines of C++ in 88 minutes [22]. The optimizations 
achieved vary for the considered criteria, Souper reported around 4.4% reduction 
in number of instructions, and SYRUP reported 0.58% in the global Ethereum 
gas usage. Scalability has been partly achieved because challenging features 
have been left out of the encoding: memory operations have been excluded both 
in Souper and SYRUP. While EBSO included a basic encoding for memory 
operations, its practicality was not proven: EBSO times out in 82% of the blocks 
and achieves optimization in less than 1% of all analyzed blocks. Leaving out 
memory operations dismisses optimization opportunities of two kinds: (a) as it 
works on smaller blocks of instructions (since the optimizer stops when finding 
a memory operation), the stack optimization is more limited, and (b) besides 
it misses possible optimizations on the memory operations themselves (e.g., 
eliminating unnecessary accesses). 


The Ethereum Virtual Machine (EVM) has two areas where it can store items 
(besides the stack): (1) the storage is where all contract state variables reside, every 
contract has its own storage and it is persistent between external function calls 
(transactions) and has a higher gas cost to use; (2) the memory is used to hold 
temporary values, and it is erased between transactions and thus is cheaper to use. 
For conciseness, we often use “memory” to include both storage and memory, as 
their treatment for optimization is identical except for their associated costs. Our 
big challenge is to be able to handle memory operations while remaining practical, 
i.e., not reaching the timeout in the optimization of the vast majority of the 
blocks. This is achieved by leveraging SYRUP’s two-staged method [7] to handle 
memory: (i) the first stage is devoted to synthesize a stack specification from the 
bytecode and apply simplification rules to it, and (ii) in a second stage a Max- 
SMT solver is used to perform the search for the optimal solution. When lifting 
such two-staged method to handle memory operations, we make two important 
extensions: in stage (i), we now synthesize a stack and memory specification 
from the bytecode on which we detect dependencies among memory operations 
and possibly remove redundant operations; (ii) this dependency information is 
included in our second stage as part of the encoding so that the SMT solver 
only needs to consider the dependence among such memory instructions when 
performing the search. Our two-staged approach allows isolating the dependency 
analysis process from the search itself, reducing the effort the SMT solver does 
in order to find the optimal sequence. The approach of Bansal and Aiken [10] 
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to handle memory operations differs from ours on the superoptimization scope 
and the search process itself. Their tool considers multiple target sequences from 
a training set simultaneously and generates a database of (possibly) millions of 
optimizations. They enumerate all well-formed instructions sequences up to a 
certain size, including memory operations, and test the equivalence among them 
via a hash function. Our tool considers each sequence of instructions to optimize 
independently and the search is done via the search engine of an SMT solver. 

GASOL”? can be considered a successor of SYRUP [7], as it adopts its two- 
staged process and reuses part of its components, but it incorporates three 
fundamental extensions, and a new experimental evaluation, that constitute 
the main contributions of this paper: (1) GASOL”? starts from the assembly 
json [1] generated by the solc compiler, rather than being used as a standalone 
optimization tool as SYRUP. This is fundamental to achieve a wide use of the tool 
since it is already linked to one of the most used compilers for coding Ethereum 
smart contracts. (2) It optimizes memory and storage operations using on one 
hand rule simplifications at the level of a specification synthesized from the 
bytecode, and on the other hand, a new SMT encoding which enables achieving 
a great balance between the accuracy and the overhead of the process. (3) While 
SYRUP is a tool that only optimizes the gas consumption of the bytecode, we 
have generalized some of its components to enable other optimization criteria. 
Currently we have included as well size in bytes, but other criteria can be 
easily incorporated now to the superoptimizer. (4) Besides we have performed a 
thorough experimental evaluation of our tool and have compared the results wrt. 
those obtained by SYRUP. The main conclusion of our evaluation is that handling 
memory operations in superoptimization pays off: it can achieve gains of 16.42% 
in gas over SYRUP, and reductions of 0.1% in gas and 3.28% in size (on already 
optimized code). If we assume that these savings are uniformly distributed, and 
the gas data obtained from Etherscan is constant, the 0.1% gas saved wrt the 
SYRUP [7] would amount nearly to 9.5 Million dollars in 2021. 

GASOL”? is part of the GASOL project [3], a GAS Optimization tooLkit for 
Ethereum smart contracts. The initial GASOL tool (i.e., GASOL”*), presented 
in [5], aimed at detecting gas-expensive patterns within program loops (using 
resource analysis) and made a program transformation (which does not rely on 
SMT solvers) at the source code level. Hence, it contains a global (inter-block) 
optimization technique that is orthogonal to our superoptimizer, in which we 
perform local (or intra-block) transformations on loop-free code, and besides we 
work at bytecode rather than at source level. Both complementary techniques 
will be integrated within the GASOL toolkit, hence their names. In what follows, 
we drop v2 and use GASOL to refer to the tool presented in this paper. 


2 The Architecture of GASOL 


Figure 1 displays the architecture of GASOL, white components are borrowed 
from other tools, while gray components correspond to the new developments of 
this paper (either completely new, like DEP, or novel extensions for memory 
handling of previous SYRUP’s implementations, like SPEC, SIMP and SMS). 
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The input to GASOL is a smart contract (either its source in Solidity or its 
compiled EVM bytecode [23]), a selection of the optimization criteria (currently 
we are supporting gas consumption and size in bytes), and system settings (this 
includes compiler options for invoking the solc compiler and GASOL settings like 
the timeout per block of instructions). The output of GASOL is an optimized 
bytecode program and optionally a report with detailed information on the 
optimizations achieved (e.g., number of blocks optimized, number of blocks 
proven optimal, gas/size reduction gains, optimization time, among others). 

The first component, labeled SOLC in the figure, invokes the Solidity compiler 
solc to obtain the bytecode in their assembly json exchange format [1]. Working on 
this exchange format has many advantages, one is that we can enable the optimizer 
of solc [4] and start the superoptimization from an already optimized bytecode. 
Besides, the format has been designed to be a usable common denominator 
for EVM 1.0, EVM 1.5 and Ewasm. Hence, we argue it is a good source for 
superoptimization as different target platforms will be able to use our tool 
equally. The assembly json format provides the EVM bytecode of the smart 
contract, metadata that relates it with the source Solidity code, and compilation 
information such as the version used to generate the bytecode. The output yield 
by GASOL can also be returned in assembly json format so that it can be used 
by other tools working on this format in the future. From the assembly json, 
the next component BLK partitions the bytecode given by solc into a set of 
sequences of loop-free bytecode instructions, named blocks, which correspond to 
the blocks of the CFG and also computes the size of the stack when entering each 
block.? We omit details of this step as it is standard in compiler construction 
and, for the case of the EVM, has been already subject of other analysis and 
optimization papers (see, e.g. [8, 14,16] and their references). 

The next component SPEC synthesizes a functional specification of the 
operand stack and of the memory and storage (SMS for short) for each block of 
bytecode instructions. This is done by symbolically executing the bytecodes in 
the block to extract from them what the contents of the operand stack and of 
the memory/storage are after executing them. The description of this component 
is given in Sec. 3.1. Next, DEP establishes the dependencies among the memory 
accesses from which a pre-order, that determines when a memory access needs 


3 In EVM, it is possible to reach a block with different stack sizes, and all such sizes 
can be statically computed. We will refer to the minimum or maximum when needed. 
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T(MLOAD, < S,M, St > 
7(MSTORE, < S,M,St > 


(1) ) := < [MLOAD(S[0])] + S[1 : n], M + [MLOAD(S[0])], St > 
(2) ) := < S[2 : n], M + [MSTORE(S[0], S[1])], St > 

(3) T(SLOAD, < S, M, St >) := < [SLOAD(S[0])] + S[1 : n], M, St + [SLOAD(S[0])] > 
(4) T(SSTORE, < S, M, St >) := < S[2 : n], M, St + [SSTORE(S[0], S[1])] > 
(5) ): = 
(6) 


7 (SWAPX, < S,M,St >) := let temp = S[0] < S[0/X][X/temp], M, St > 
T(POP, < S,M,St >) := < S[1 : n], M, St > 


Fig. 2: SMS Synthesis by Symbolic execution 


to be performed before another one, is generated. For instance, subsequent load 
accesses, which are not interleaved by any store, do not have dependencies among 
them, while they do have with subsequent write accesses to the same positions. 
This phase is described in Secs. 3.2 (dependencies) and 3.3 (pre-order). In the 
next component SIMP, we apply simplification rules on the SMS. We include 
all stack simplification rules of SYRUP [7], as well as the additional rules we 
have developed for memory /storage simplifications. For instance, successive write 
accesses that overwrite the same memory position are simplified to a single one 
provided the same memory location is not read by any other instruction between 
them. The description of this component is given in Sec. 3.2. Finally, we generate 
a Max-SMT encoding from the (simplified) SMS that incorporates the pre-order 
established by the component DEP and from which the optimized bytecode is 
obtained. The description of this component is given in Sec. 4. 


3 Synthesis of Stack and Memory Specifications 


This section describes the first stage of the optimization (components SPEC, 
SIMP and DEP) that consists in synthesizing from a loop-free sequence of byte- 
code instructions a simplified specification of the stack and of the memory /storage 
(with the dependencies) that the execution of such bytecodes produces. 


3.1 Initial Stack and Memory/Storage Specification 


For each block, we synthesize its Stack and Memory Specification (SMS) by 
symbolically executing the instructions in the sequence. Function 7 in Fig. 2 
defines the symbolic execution for the memory/storage operations (1-4) and 
includes two representative stack opcodes (5-6). The first parameter of 7 is a 
bytecode instruction and the second one is the SMS data structure < S,M,St > 
whose first element corresponds to the stack (S), the second one to the memory 
(M), and the third one to the storage (St). The stack S is a list whose position 
S[0] corresponds to the top of the stack. At the beginning of executing a block, 
the stack contains the minimum number of elements needed to execute the block 
represented by symbolic variables s;, where s; models the element at S/i]. The 
resulting list M (St resp.) will contain the sequence of memory (storage resp.) 
accesses executed by the block. By abuse of notation, we often treat lists as 
sequences. Both M and St are empty before executing the block symbolically. 
As an example, the symbolic execution of SSTORE removes the two top-most 
elements from S, and adds the symbolic expression SSTORE(S[0],S[1]) to the 
storage sequence. Similarly, SLOAD removes from the top of the stack the position 
to be read, puts on the top of the stack the symbolic expression SLOAD(S[0]) that 
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represents the value read from the storage position S[0], and adds the same 
expression to the storage sequence St. As a result of applying 7 to a sequence of 
bytecodes, the SMS obtained provides a specification of the target stack after 
executing the sequence in terms of the elements located in the stack before 
executing the sequence and, the target memory/storage (given as a sequence of 
accesses) after executing the sequence in terms of the input stack elements too. 


Example 1. Consider the following bytecode that belongs to a real contract 
(bytecodes 0 to 47 of Welfare [2]). Its assembly json yield by the SOLC component 
contains 4524 bytecodes and after being partitioned by BLK we have 437 blocks 
to optimize. We illustrate the superoptimization of this block that contains in 
total 48 bytecodes from which 5 are the (underlined) memory/storage accesses: 


1 PUSH1 80 9 DUP2 17 DUP4 25 PUSH2 3E8 33 PUSH2 FFFF 41 MUL 

2 PUSHI1 40 10 SLOAD 18 PUSH2 FFFF 26 PUSH11 34 MUL 42 OR 

3 MSTORE 11 DUP2 19 AND 27 PUSH1 16 35 NOT 43 SWAP1 

4 PUSHI1 64 12 PUSH2 FFFF 20 MUL 28 PUSH2 100 36 AND 44 SSTORE 

5 PUSH1 1 13 MUL 21 OR 29 EXP 37 SWAP1 45 POP 

6 PUSHI1 14 14 NOT 22 SWAP1 30 DUP2 38 DUP4 46 CALLVALUE 
7 PUSH2 100 15 AND 23 SSTORE 31 SLOAD 39 PUSH2 FFFF 47 DUP1 

8 EXP 16 SWAP1 24 POP 32 DUP2 40 AND 48 ISZERO 


As BLK returns that the stack is empty when entering the block, we apply 7 to 


the initial state < [],[],[] > and produce the following SMS at the next selected 
lines: £1: r(pusHi 80,<[],[],[] >) =< [128], [],[] > 

L2:7(PUSH1 40, < [128], [ ],[] >) =< [64, 128],[],[] > 

L3 : T(MSTORE, < [64, 128], [ ],[ ] >) =< [ ], [MSTORE(64,128)],[] > 


Finally, we get that at L48 S = [ISZERO(CALLVALUE) , CALLVALUE], M = [MSTORE (64, 128)], 
St = [SLOAD, (1), SSTORE(1,V1),SLOAD2(1),SSTORE(1,V2)] where V1 = OR(MUL(...)), 
AND(NOT(..)), SLOAD) (1)) (omitting subexpressions) and V2 is another similar expression 
involving arithmetic, binary operations and SLOAD2(1). Note that we use subscripts 
to distinguish the SLOAD instructions by their position in St. The stack specification 
contains a term that represents the result of the opcode CALLVALUE (executed at line 46, 
L46 for short), and a term with the result of executing the opcode ISZERO on CALLVALUE, 
stored on top of the stack. The memory only contains one element that is obtained by 
symbolically executing the three first instructions. The PUSH instructions at L1 and L2 
introduce the values 64 and 128 on the stack, and the MSTORE executed at L3 introduces 
in M the symbolic expression MSTORE(40,80). Similarly, St contains the sequence of 
symbolic expressions that represent the storage instructions executed in the block at 
L10, L23, L31 and L44 respectively. The expressions corresponding to V1 and V2 are also 
obtained by applying function 7 to the corresponding state. These stack expressions 
can be simplified in the next step using the rules in [7]. 


We note that the EVM memory is byte addressable (e.g., with instruction MSTORES) 
and two different memory accesses may overlap. For simplicity of the presentation, 
we only consider the general case of word-addressable accesses, but the technique 
extends easily to the byte addressable case. In what follows, we use LOAD to 
abstract from the specific memory (MLOAD) and storage (SLOAD) bytecodes (and 
the same for STORE), when they are treated in the same way. 


3.2 Memory/Storage Simplifications 

In order to define the simplifications, and to later indicate to the SMT solver 
which memory instructions need to follow an order, we compute the conflicts 
between the different load and store instructions within each sequence. 
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Definition 1. Two memory accesses A and B conflict, denoted as conf(A,B) if: 

(i) A is a store and B is a load and the positions they access might be the same; 

(ii) A and B are both stores, the positions they modify might be the same, and 
they store different values. 


Note that in (ii) two store instructions that might operate on the same position 
do not conflict if the values they store are equal, as we will reach the same 
memory state regardless of the order in which the stores are executed. Note that 
two load instructions are never in conflict as the memory state does not change 
if we execute them in one order or another. 

Given the SMS obtained in Sec. 3.1, we achieve simplifications by applying 
the stack simplification rules of [7] and, besides, the following new memory 
simplification rules based on Def. 1 to the M and S components (that achieve 
optimizations of type (b) according to the classification mentioned in Sec. 1): 


Definition 2 (memory simplifications). Let < S,M,St > be an SMS, we 
can apply the following simplifications to any subsequence by,...,b, in M or St: 
i) if by =STORE(p, v) and bn =LOAD(p) and fb; =STORE with i € {2,...,n—1} and 

conf(b1,b;), we simplify it to by,...,bn—-1 and replace bn by v in the resulting 
SMS. 

ii) if bı =STORE(p,v) and bn =STORE(p,w) and fb; =LOAD with i € {2,...,n—1} 
conf(b;,b;), we simplify it to bo,...,bn.- 

iti) if bı =LOAD(p) and bn =STORE(p,LOAD(p)) and fb; =STORE with i € {2,...,n— 
1} conf(b,b;), we simplify it to by,...,bn—1. 


The simplifications can be applied in any order within M and St until the process 
converges and the resulting sequence cannot be further simplified. 


Intuitively, in (i), a load instruction from a position after a store instruction to 
the same position is simplified in the stack to the stored value provided there 
is no other store operation in between that might have changed the content 
of this position. In (ii), two subsequent store instructions to the same position 
are simplified to a single store if there is no load access on the same position 
between them. In (iii), a store instruction that stores in a position the result of 
the load in the same position can be removed, provided there is no other store in 
between that might have changed the content of this position. Note that such 
simplification rules can be applied to general-purpose compilers. 


Example 2. In the SMS of Ex. 1, we have that conf(SLOAD (1) ,SSTORE(1,V1)), conf 

(SLOAD; (1) SSTORE(1,V2)), conf(SLOAD2 (1) ,SSTORE(1,V1)), conf(SLOAD2 (1) ,SSTORE(1, 
V2)) and conf(SSTORE(1,V1,SSTORE(1,V2)) as all accesses operate on the same lo- 
cation. With these conflicts, we can apply rule i) to SLOAD2 (1), as the previous 

SSTORE instruction has stored the value V1 at the same location and there are no 

other storage instructions with conflict between them. Hence, we eliminate it 

from St and replace it by V1 in the resulting SMS. After that, we are able to apply 

rule ii) on the two SSTORE instructions as they store a value at the same position 

without conflict loads in between. Then, we remove SSTORE(1,V1) from St. The 

resulting SMS has the same S and M and St is now [SLOAD) (1), SSTORE(1,V2’)]| 

where V2’ is V2 replacing SLOAD2(1) by V1. 
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3.3 Pre-Order for Memory and Uninterpreted Functions 


Given the SMS and using the conflict definition above, we generate a pre-order, 
as defined below, that indicates to the SMT solver the order between the memory 
accesses that needs to be kept in order to obtain the same memory state as the 
original one. Clearly, having more accurate conflict tests will result in weaker 
pre-orders and hence a wider search space for the SMT solver. This in turn will 
result in potentially larger optimization. Our implementation is highly parametric 
on the conflict test DEP so that more accurate tests can be easily incorporated. 


Definition 3. Let A and B be two memory accesses in a sequence S. We say 
that B has to be executed after A in S, denoted as AC B if: 


i) (store-store) B is a store instruction and A is the closest store instruction 
predecessor of B in S such that conf(A,B). 
ii) (load-store) A is a load instruction and B is the closest store instruction 
successor of A in S such that conf(B,A). 
iii) (store-load) B is a load instruction and A is the closest store instruction 
predecessor of B in S such that conf(A,B). 


Let us observe that we do not compute the closure for the dependencies at this 
stage, as the SMT solver will infer them, as explained in Sec. 4.2. 


Example 3. From the simplified SMS of Ex. 2, we get the following load-store 
dependency, SLOAD; (1) C SSTORE(1,V2’), while the access MSTORE(64,128) has no 
dependencies as it is the unique memory operation. 


Importantly, the notion of pre-order between memory instructions can also be 
naturally extended to all other operations that occur in the specification of the 
target stack. These operations are handled as uninterpreted functions and have to 
be called in the right order to build the result that is required in the target stack. 
Therefore, we propose a novel implementation (both in SYRUP and GASOL) that 
extends the pre-order C to uninterpreted functions by adding A C B also when: 


iv) (uninterpreted-functions) A and B are uninterpreted functions that occur in 
the target stack as B(...,A(...),...). 


While in the case of uninterpreted functions the pre-order is used for improving 
performance, for memory operations the use of the pre-order is mandatory for 
soundness, since it is what ensures that the obtained block after optimization has 
the same final state (in the stack, memory and storage) than the original block. 


3.4 Bounding the Operations Position 

As we will show in the next section, a solution to our SMT encoding assigns a 
position in the final instruction list to each operation such that the target stack 
is obtained. A key element for the performance of the encoding we propose in 
this paper is based on extracting from the instruction pre-order C, upper and 
lower bounds to the position the operations can take in the instruction list. The 
lower bound for a given function is obtained by inspecting the subterm where it 
occurs in the target stack and analyzing its operands to detect the earliest point 
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in which the result of all them can be placed in the stack, taking into account 
that shared subcomputations can be obtained using a DUP opcode. On the other 
hand, the upper bound for a function is obtained by inspecting the position in 
the target stack they occur and analyzing the operations that use the term that 
is headed by this function, to obtain the latest point in which this term could be 
computed. From this analysis, we obtain both the upper UB(v) and lower LB(v) 
bounds for every uninterpreted (which includes the load) and store operation +, 
which are extensively used in the encoding provided in the next section. 


4 Max-SMT Superoptimization 


This section describes the second stage of the optimization process (named Max 
SMT in Fig. 1) that consists in producing, from the SMS and the dependencies, 
a Max-SMT encoding such that any valid model corresponds to a bytecode 
equivalent to the initial one and optimized for the selected criterion. 


4.1 Stack Representation in the SMT Encoding 


The stack representation is the same as in [7]: the stack can hold non-negative 
integer constants in the range {0,...,2?°°—1}, matching the 256-bit words in the 
EVM; initial stack variables s9,..., 5,1, represent the initial (unknown) elements 
of the stack; and fresh variables s;,...,8, abstract each different subterm (built 
from opcodes and the initial stack variables) that appears in the SMS. A stack 
variable of the form s; is represented in the encoding as the integer constant 
2756 4+ i, so that all stack elements in the model are integer values. To represent 
the contents of the stack after applying a sequence of instructions, a bound on 
the number of operations b, and the size of the stack bs must be given. These 
numbers are statically computed by considering the size of the initial block and 
the maximum number of stack elements involved. Then, propositional variables 
Uui j, With i € {0,...,b;—1} andj € {0,..., bo}, are used to denote whether there 
exists an element at position 7 in the stack after executing the first j operations, 
where the element uo,; refers to the topmost element of the stack. Quantified 
variables z; j € Z are introduced to identify the word at position 7 after applying 
j operations, following the same format as ui j. 

An instruction ų € Z in the encoding can be either a basic stack opcode 
(POP, SWAPk, ...), a distinct expression that appears in the SMS or the extra 
instruction NOP that represents the possibility no opcode has been applied. A 
mapping 0 is introduced to link every instruction in Z to a non-negative integer 
in {0,...,m,}, where m, + 1 = |Z|. This way, we can introduce the existentially 
quantified variables tj, with t; € {0,...,m,} and j € {0,...,b, — 1}, to denote 
that the instruction + is applied at step j when t; = 0(¿). There is a special case 
to be considered when identifying the instructions from an SMS: each expression 
containing a single occurrence of an opcode in Wyase (see [23]) is considered as an 
independent expression with a different 1. Opcodes in Wbase consume no operand 
from top of the stack and have lower gas cost and equal byte count as DUPk, so we 
can safely assume that in an optimal block such expressions are never duplicated. 
For efficiency reasons, we also apply the reciprocal: any other expression is forced 
to appear exactly once in our solution, as our experiments show that duplicating 
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the expression is always better than computing it more than once. However, note 
that this may not hold, in general, when the cost of the expression is low or the 
size of the operating stack is high, and hence, although is highly unlikely, we may 
lose some better solutions. From this assumption, we have that every ¿ we have 
introduced must appear exactly once in every model, which simplifies greatly 
both the pre-order encoding and the gas model used. The following example 
illustrates how the SMS is processed and the relevance of considering Wpase: 


Example 4. Consider a modified version of Ex. 1, in which S = [ISZERO (CALLVALUE) 
CALLVALUE] but M,St are both empty. bo,b; are bounded to 3 and 2 resp., as 
three instructions are enough to compute the given SMS and it reaches a stack 
size of two elements. Each application of CALLVALUE is considered independently, 
as CALLVALUE € Wyase . Variables sọ := 2755, 8, := 2256 + 1,89 := 27564 2 are 
introduced to represent the stack variable obtained from CALLVALUEg, CALLVALUE 
and ISZERO(CALLVALUE;). GASOL creates the following 0 map: 
6 := {PUSH : 0, POP: 1, NOP : 2, DUP1 : 3, SWAP1 : 4, 
CALLVALUEg : 5, CALLVALUE, : 6, ISZERO(CALLVALUE;) : 7} 


The optimal sequence is CALLVALUE CALLVALUE ISZERO, which consumes 7 units of 
gas. It improves the cost of L46-L48, which consumes 8 due to the use of DUP1. 


The set of instructions Z can be split in four subsets Zs W Ty W Zo W Tst: 


— Ts contains the basic stack operations: PUSH, POP, NOP, DUPA, and SWAPk, with 
k € {1,...,min(b, — 1,16)}. DUPk and SWAPk are restricted by bs because 
they cannot deal with elements that go beyond the maximum stack size. 

— Ty contains the non-commutative uninterpreted functions that appear in the 
SMS. Its subset Zr, C Zy denotes the set of load instructions. 

— To contains the commutative uninterpreted functions in the SMS. 

— Ts contains the write operations in memory structures. 


The encoding for subsets Zs W Zy W Zc was already considered in [7], whereas 
Ts was left out. Instead, blocks were split when an opcode belonging to Zs; was 
found. The inclusion of Zs; instructions in the model leads to more savings in 
gas, as more optimizations can be applied in larger blocks (those correspond to 
optimizations of type (a) in the classification given in Sec. 1). 

For each ų € Z and each possible position 7 in the sequence of instructions, we 
add a constraint to represent the impact of this combination on the stack. These 
constraints match the semantics of r when projecting onto the stack component, 
so that we encode the elements of the stack after executing ų in terms of the 
ones before its execution. They follow the structure t; = 0(4) > C.(j), where 
C,(j) expresses the changes in the stack after applying +. The constraints for 
Ts W Ty W To are detailed in [7], our extension in this section is only to include 
the constraints to reflect the impact of storage operations on the stack. For this 
purpose, we use an auxiliary predicate Move (already used in [7]) to denote that 
all elements in the stack are moved two positions to the right in the resulting 
stack state. Thus, we have the following constraint for each position j and each 
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L E€ Ts, where og and o1 denote the position and value stored: 
Col), L) = tj = O(c) => U0,j A U1,j A To, j = 90 A T1 j = 01 A 
Move(j, 2,6, — 1, —2) A Ub, —1,j+1 ^ 7Ub,—2,j41 

Finally, we express the contents of the stack before executing the instructions 
of the block (initial stack) and after having executed them (target stack) by 
assigning the corresponding values (whether constants or stack variables) to 
Ui o, Zio and to Ui b., Vip, respectively. The overall SMT encoding for the stack 
representation is denoted as Csps and it is encoded using QF_LIA logic. 


Example 5. Following Ex. 4, GASOL generates the constraint shown below to 

update the contents of the stack after applying 1 = ISZERO(CALLVALUE,) at step 2: 
C,(2,¢) == tg = T= U0,2 A £T0,2 = 2256 +1A 

9256 | 


uo,3 A £03 = H2 A u13 = U1,2 A 21,3 = 21,2 


4.2 Encoding the Pre-order Relation 


Once the stack representation has been formalized, we also need to consider the 
conflicts that appear among memory operations as part of our encoding, as well as 
the dependencies between uninterpreted functions. All this is made by encoding 
the pre-order relation given in Sec. 3.3. We consider each pair of instructions 1, 0’ 
s.t. ¿ Cv’. We aim to prevent conflicting operations from appearing in the wrong 
order in a model, by imposing that « cannot occur in the assignment after v’. 
Our proposed approach consists in introducing a variable lę(,) for every 
instruction 4 € Zo U Ty UL st := Liorg to track the position it appears in a 
sequence. This information is useful for specifying multiple conditions in the 
encoding that are difficult to reflect otherwise. Firstly, these variables implicitly 
enforce that 1 must be tied to exactly one position, and thus, included in every 
sequence exactly once. Besides, we can narrow the positions in which ¿ can appear 
by using LB(z),U B(c) bounds. Finally, as QF_LIA supports ordering among 
variables, the order between conflicting instructions can be encoded as a plain 
comparison between their positions. Hence, the following constraints are derived: 
Lp(t) := LB(t) < leu) < UB(L) A VAN (lou) = j) & (t; = O00) 
LB()Sj< UB(4) 


1 


Lioralt, t’) := loc.) < lou’) where tC u 


Regarding memory operations, there is no need to consider special cases. The 
whole encoding can be expressed as follows: 


Csms:= Csrs ^ N Lel) A N Lioralt,t’) 


tETiord vce’ 


4.3 Optimization using Max-SMT 


As in [7], we formulate the problem of finding an optimal block as a partial 
weighted Max-SMT problem. In this section we show that the same encoding 
for gas optimization can be used in the presence of memory operations and 
that other optimization criterion, like bytes-size, can be included as well in 
our framework. Basically, in our Max-SMT problem, the hard constraints that 
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must be satisfied by every model are those constraints for computing the SMS; 
and the soft constraints are used to find the optimal solution: a set of pairs 
{[Ci,w1],---,[Cn,wn]}, where C; denotes an SMT clause and w; its weight. The 
Max-SMT solver minimizes the weights of the falsified soft constraints. The 
weights of soft constraints presented in [7] match the gas spent for the sequence of 
instructions, thus ensuring an optimal model corresponds to a block that spends 
the least possible amount of gas. This gas encoding is also included in GASOL, 
but instructions in Ziorą are removed from the soft constraints. Hard constraints 
already assert the exact number of times these instructions must appear in a 
sequence and therefore, they only add unnecessary extra cost that may harm the 
search of an optimality proof. 


However, gas consumption is not the only relevant objective to consider when 
optimizing the code. When a contract is deployed, a fee of 200 units of gas must 
be paid for each non-zero byte of the EVM binary code. The desired trade-off 
between the initial deployment cost and invoking transactions can be specified in 
solc by setting the expected number of contract runs. In some cases, this leads to 
solc intentionally not fully replacing expressions that have a constant result by 
the value they represent if this constant is a large number, since the needed PUSH 
instructions will need many more non-zero bytes and hence will increment the 
deployment gas cost. For instance, if we want to have 27°° — 1 on the top of the 
stack we can either push a zero and perform the bitwise NOT operation, which 
has gas cost 6 and non-zero bytes length 2 or push 27° — 1 directly which has 
gas cost 3 but non-zero bytes length 33. 


When the bytes-size criterion is selected, we disable the application of the 
simplification rules of [7] that increase the byte-size and, besides, propose the 
next approach based on the bytes-size model for the Max-SMT encoding. This 
model is fairly simple except for the handling of the PUSH related instructions, 
denoted as Zp in what follows. All instructions that are not in Zp use exactly 
one byte. Instead PUSHx instructions take one byte to specify the opcode itself, 
and x bytes to include the pushed value. A first attempt to encode the weight 
of the PUSHx we tried was based on precisely describing the size in bytes based 
on the corresponding 32 options that x can take in terms of number of bytes. 
(recall that in EVM we have 256-bit words). This encoding is precise, but did 
not work in practice. An alternative, much simpler encoding, is based on the 
observation that numerical values can only appear in a model because at least 
once the corresponding PUSHx instruction is made. Later on, this value can be 
repeated using DUP, which has a minimal cost wrt. size of bytes, but if the 
block is large, some SWAP operation may also be needed. To make the encoding 
perform well in practice, we need to associate a single constant weight to all PUSHx 
operations, that is high enough to avoid models where expensive PUSHx operations 
are performed more than once instead of duplicating them. Our experiments have 
shown that a weight of 5 is enough to obtain optimal results for the sizes of blocks 
that the Max-SMT is able to handle. Then, we can assume NOP instructions cost 
0 units, instructions in Zp costs 5 units and the remaining instructions cost 1 unit. 
Hence, three disjoint sets are introduced to match previous costs: Wo := {NOP}, 
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W; := Z, and W, := Ts \ (Wo 8 W5). Q’ bytes-size model is followed directly: 

Mus = VU {lt = anor). Yo t)=4),4]} 

O0<j<bo tEeWod W1 

Example 6. The optimized bytecode returned by GASOL for the gas criterion 
iS PUSH24* PUSH 80 PUSH 40 MSTORE PUSH 1 SLOAD PUSH32* AND PUSH21* OR PUSH32* AND OR PUSH 1 SSTORE 
CALLVALUE CALLVALUE ISZERO (using * to skip large constants), which achieves a reduction 
of 5905 units wrt. the original version and is proven optimal. For the bytes-size 
criterion, GASOL times out due to the larger size of the block when size-increasing 
simplification rules are disabled. This issue will be discussed in Sec. 5. 
5 Implementation and Experiments 


This section provides further implementation details and describes our experimen- 
tal evaluation. The GASOL tool is implemented in Python and uses as Max-SMT 
solver OptiMathSAT (OMS) [13] version 1.6.3 (which is the optimality framework 
of MathSAT). The aim of the experiments is to assess the effectiveness of our 
proposal by comparing it with the previous tool SYRUP. A timeout is given to the 
tools to specify the maximum amount of time that they can use for the analysis of 
each block. The timeout given to GASOL must be larger than for SYRUP because 
it works on less and larger blocks in order to analyze the same contract. We 
have used as timeout for SYRUP 10 sec, and for GASOL, we use 10*(#store+1) 
sec, as this would correspond to the addition of the times in SYRUP given to 
the partitioned blocks. It should be noted though that the cost of the search to 
be performed grows exponentially with the number of additional instructions. 
Therefore, in spite of giving a similar timeout, GASOL might time out in cases 
in which it has to deal with rather large blocks, while SYRUP does not on the 
corresponding smaller partitioned blocks. For this reason, we have implemented 
two additional versions: gasol,;; splits the blocks at all stores as SYRUP, and 
gasolz, splits at store instructions only those blocks that have a size larger than 
24 instructions. This is because we have observed during experimentation that 
the SMT search does not terminate in a reasonable time from that size on. The 
24-partitioning starts from the end of the block and splits it if it finds a store. If 
the partitioned sub-block (from the start) still has a size larger than 24, further 
partitioning is done again if a new store is found from its end, and so on. Still, 
depending on where the stores are, the resulting blocks can have sizes larger than 
24, as it happens in SYRUP as well. Further experimentation will be needed 
to come up with intelligent heuristics for the partitioning. The gasol versions 
implement all techniques described in the paper, including the SMT encoding 
dependencies between uninterpreted functions as described in Sec. 3.3. We have 
the following versions of GASOL and SYRUP in the evaluation: (1) syrupcay is 
the original tool from [7], (2) gasolqi splits the blocks at all stores as in syrupcav, 
(3) gasolo4 performs the 24-partitioning described above, (4) gasolnone does not 
perform any additional partitioning of blocks, and (5) gasolyes: uses gasolaiz, 
gasolo4, and gasolnone, as a portfolio of possible optimization results (running 
them in parallel) and keeps the best result. 

We run the tools using the gas usage and the bytes-size criteria in Sec. 4.3. 
As already mentioned, SYRUP in [7] did not include the bytes-size criterion, 
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Gnormal Gtimeout %G total gas Bnormal Beimeout ABeotal Toytes 
syrUPcav || 35689 | 11129 | 0.62% |142,93 -= E = = 
gasolaii 36344 | 11975 | 0.64% |120,21|| 3712 2213 | 2.64% | 200,17 
gasolo4 38765 | 12336 | 0.68% |327,36]| 4315 2238 | 2.92% | 558,48 
Zasolnone|| 39977 0 0.53% |850,75)]| 3871 0 1.72% |1194,38 
gaSolbest || 41307 | 13197 | 0.72% |933,66]| 4676 2692 | 3.28% |1313,36 
Table 2: Overall gains in gas and bytes-size and overheads 


“i 


marked as in the figures. Experiments have been performed on an Intel Core 
i7-7700T at 4.2GHz x 8 and 64Gb of memory, running Ubuntu 16.04. 


The benchmark set. We have downloaded the last 30 verified smart contracts 
from Etherscan that were compiled using the version 8 of solc and whose source 
code was available as of June 21, 2021. The reason for this selection is twofold: 
(1) we require version 8 in order to be able to apply the latest solc optimizer and 
start from a worst-case scenario in which we have the most possible optimized 
version and, this way, assess if there is room for further optimization and, in 
particular, for the two types of gains achievable by GASOL (see Sec. 1), (2) 
we want to make a random choice (e.g., the last 30) rather than picking up 
contracts favorable to us. The benchmarks in [7] require using an old version 
of the compiler (at most 4), hence the last solc optimizer cannot be activated. 
The source code of GASOL as well as the smart contracts analyzed are available 
at https: //github.com/costa-group/gasol-optimizer. We provide the results of 
analyzing the compiled smart contracts generated by the version 0.8.9 of solc with 
the complete optimization options. The total number of blocks, given by BLK, 
for the 30 contracts is 12,378. Within them, there are 1,044 SSTORE instructions, 
6,631 MSTORE and 43 MSTORE8. These memory instructions are used by SYRUP to 
split the basic blocks, while GASOL does not split them always as explained above. 
This results on 15,416 blocks when considering the additional 24-partitioning, 
13,030 without partitioning at stores by gasolnone, and 20,467 blocks by syrupcay 
and gasol,.;. As in [7] all tools split blocks at instructions like LOGX or CODECOPY . 


Efficiency gains and performance overhead. Table 2 shows the overall gas and 
size gains and the optimization time (in minutes). The total gas consumed by all 
contracts before running the optimizers is 7,538,907, and the bytes-size is 224,540. 
As it is customary, we are calculating such gas (resp. size) as the sum of the gas 
(resp. size) consumed by all EVM instructions in the considered contracts.* For 
those EVM instructions that do not consume a constant and fixed amount of gas, 
such as SSTORE, EXP or SHA3, we choose the lower bound that they may consume. 
Column Gpormat refers to the gains for the blocks that do not timeout giving no 
solution, Grimeout represents the gas saved by the optimized blocks that reached 
the timeout in gasol,one with no result (note that Gyopmar is the complementary 
of Gtimeout), and Gotai the total gains computed as the sum of the previous 
two, given as a percentage wrt. the initial gas consumption. Columns B have the 
analogous meanings for size and T gives the time in minutes. The first observation 
is that our proposal of using dependencies in gasolaų pays off, as we achieve larger 


t Estimating the actual gains of executing transactions on the involved contracts is a 
research problem on its own which has been subject of other work, e.g., [6,16, 19, 24]. 
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#B || Alr, |Opt,|Bet,|Non,|Tout, |} Alr |Opt, | Bet, | Non, |Tout, 
SyrUpcav ||20467]| 70.54) 27.01] 0.47 | 0.08 | 1.9 
gasolai ||20467|| 70.63) 27.36] 0.64 | 0.35 | 1.02 |/83.25/12.83) 1.2 | 0.69 | 2.03 
gasolo4 15416]) 62.2 |33.79| 1.47 | 0.91 | 1.63 ||75.48|16.29| 3.21 | 1.78 | 3.24 
gasolnone || 13030]/65.48) 25.3 | 3.81 | 0.34 | 5.07 ||73.44| 11.7 | 3.1 | 2.57 | 9.19 
Table 3: Optimization report (%) for SYRUP and GASOL 


gains than syrupcay in less time. The second observation is that the gains in gas 
of GASOL are notably larger for blocks that do not time out Gpormat, as a larger 
search space can be explored. However, those blocks that would require a larger 
timeout might behave worse than the syrupcay and gasolaz versions working on 
smaller blocks, as the original bytecode is taken as the optimization result in case 
of timeout. This sometimes happens in the version gasolg4, and more often in 
gasolnone. The problem is exacerbated for the bytes-size criterion because larger 
blocks are considered as a result of skipping size-increasing simplification rules. 
Even in Bnormat the gain is smaller for gasolnone than for gasolg4. This is because 
Bpormat includes timeouts for which a solution is found. Our solution to mitigate 
the huge computation demands required in these cases is in row gaSolpes¢ that 
runs in parallel gasol,j;, gasole, and gasolnone and returns the best result. As it 
can be seen, gaSolpest clearly outperforms the other systems in gas and size gains. 
As regards the overhead, it is also the most expensive option, as it reaches the 
timeout more often than the other systems and these timeouts are accumulated to 
the time. However, as superoptimizers are often used as offline optimization tools, 
which are run only prior to deployment, we argue that the gains compensate the 
further optimization time. Finally, it remains to be investigated the interaction 
between the two optimization criteria, namely how the reduction in bytes-size 
affects the gas consumption and vice versa. 


Impact of phases 1 and 2. We would also like to estimate how much is gained in 
gasol,..¢ by applying the simplification rules and how much is gained by the SMT 
encoding. Regarding the simplification rules on memory, gaSolbest has applied 6 
rules on storage and 11 on memory: 15 of them correspond to the rule i) (4 on 
storage and 11 on memory) described in Def. 2, and 2 to the rule ii) (both on 
storage). Rule iii) is never applied on this benchmark set, but we have applied 
it when optimizing other real smart contracts. As regards the percentage of the 
gains, 14.6% of the gas savings come from applying the memory rules, 34.4% 
from the stack rules and 51% is saved by the use of the Max-SMT solver. As 
in [7], the gains due to each phase are roughly half (i.e., 50% each). Regarding the 
simplification rules on stack for the gas criterion, their application has increased 
11.4% in gasoly.,; because it works on larger blocks and has more opportunities 
to apply them. However, when selecting the bytes-size criteria, there are less 
simplification rules applied (namely 96% less) as when the rules generate larger 
code in terms of size they are not applied (see Sec. 4.3). 


Optimality results. Table 3 provides additional detailed information, which is 
also part of the optimization report of Fig. 1. Column #B shows the total 
number of blocks analyzed in each case, depending on the partitioning. In the 
remaining columns, we show the percentages of: Column Alr blocks that are 
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already optimal, i.e., those blocks that cannot be optimized because they already 
consume the minimal amount of gas; Opt blocks that have been optimized and 
the SMT solver has proved the optimality of the solution, i.e., they consume 
the minimum amount of gas needed to generate the provided SMS; Bet blocks 
that have been optimized and therefore, consume less gas than the original ones, 
but the solution is not proved to be optimal; Non blocks that have not been 
optimized and the solver has not been able to prove if they are optimal, i.e., the 
solution found is the original one but it may exist a better one; Tout blocks 
where the solver reached the timeout without finding a model. The subscripts » 
are the analogous for the bytes-size criterion. We can observe in the table that 
Zasolnone times out in more cases due to the larger sizes of the blocks that it 
optimizes, but the percentages of blocks for which it finds a better and optimal 
solution are notably high. It should also be noted that the results of SYRUP (and 
gasol,;;) and, to a lesser extent, of gasolo, wrt. optimality are weaker. This is 
because they work on strictly smaller blocks and hence they can prove optimality 
for the partitioned blocks, but when glued together, the optimality may be lost. 
This is also the reason why the results for gasolpest are not included, because it 
mixes different notions of optimality and the concepts are not well-defined. Due 
to this weaker optimality, the Opt and Bet results are only slightly better for 
GASOL than for SYRUP. However, the truly important aspect is that the actual 
gas and size gains for GASOL in Table 2 are notably larger. 


6 Conclusions and Future Work 


We have presented GASOL”?, a Max-SMT based superoptimizer for Ethereum 
smart contracts that uses the assembly json exchange format of the solc compiler 
for a direct integration into it. GASOL”? extends the Max-SMT approach of 
SYRUP [7] with memory and storage operations, which constitute the most chal- 
lenging and relevant features left out in SYRUP’s approach. GASOL”? is part of 
the GASOL project [3] that aims at developing a GAS Optimization tooLkit that 
will integrate inter-block optimizations [5] as well. Namely, the initial optimizer [5] 
of the GASOL project uses inter-block analysis to detect storage accesses that can 
be replaced by cheaper memory accesses, thus making global optimizations that 
are orthogonal and complementary to our intra-block ones. As part of our future 
work, we plan to investigate potential synergies among the different proposals 
to optimization for smart contracts. This includes also the cooperation with the 
solc optimizer [4] that incorporates classical compiler optimizations (e.g., dead 
code elimination, constant propagation, etc.) from which our superoptimizer 
is already benefiting (since we are applying the solc optimizer). In the other 
order of application, we expect also gains when applying classical analyses after 
superoptimization. For instance, we have also observed that after applying rule 
simplification (i) in Def. 2 and eliminating load instructions, we might leave 
store operations on memory locations that will never be accessed again, and that 
could be eliminated afterwards by applying an inter-block analysis ensuring that 
there are no further access to such memory location. The combination of the 
techniques and tools thus seems a promising direction for future research. 
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Abstract. We present L*, a new and simple approach to active automata 
learning. Instead of focusing on equivalence of observations, like the L* 
algorithm and its descendants, L* takes a different perspective: it tries to 
establish apartness, a constructive form of inequality. L* does not require 
auxiliary notions such as observation tables or discrimination trees, but 
operates directly on tree-shaped automata. L* has the same asymptotic 
query and symbol complexities as the best existing learning algorithms, 
but we show that adaptive distinguishing sequences can be naturally 
integrated to boost the performance of L* in practice. Experiments 
with a prototype implementation, written in Rust, suggest that L* is 
competitive with existing algorithms. 


Keywords: L* algorithm - active automata learning - Mealy machine - 
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1 Introduction 


In 1987, Dana Angluin published a seminal paper [5], in which she showed that 
the class of regular languages can be learned efficiently using queries. In Angluin’s 
approach of a minimally adequate teacher (MAT), learning is viewed as a game 
in which a learner has to infer a deterministic finite automaton (DFA) for an 
unknown regular language L by asking queries to a teacher. The learner may 
pose two types of queries: “Is the word w in L?” (membership queries), and “Is 
the language recognized by DFA H equal to L?” (equivalence queries). In case of 
a no answer to an equivalence query, the teacher supplies a counterexample that 
distinguishes hypothesis H from L. The L* algorithm proposed by Angluin [5] is 
able to learn L by asking a polynomial number of membership and equivalence 
queries (polynomial in the size of the corresponding canonical DFA). 

Angluin’s approach triggered a lot of subsequent research on active automata 
learning and has numerous applications in the area of software and hardware 
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analysis, for instance for generating conformance test suites of software compo- 
nents [28], finding bugs in implementations of security-critical protocols [22,23,21], 
learning interfaces of classes in software libraries [33], inferring interface protocols 
of legacy software components [8], and checking that a legacy component and a 
refactored implementation have the same behavior [55]. We refer to [63,34] for 
surveys and further references. 

Since 1987, major improvements of the original L* algorithm have been 
proposed, for instance by [52,53,38,41,56,35,45,50,32,37,25]. Yet, all these im- 
provements are variations of L* in the sense that they approximate the Nerode 
congruence by means of refinement. Isberner [36] shows that these descendants 
of L* can be described in a single, general framework.! 

Variations of L* have also been used as a basis for learning extensions of DFAs 
such as Mealy machines [48], I/O automata [2], non-deterministic automata [16], 
alternating automata [6], register automata [1,17], nominal automata [46], sym- 
bolic automata [40,7], weighted automata [14,11,30], Mealy machines with timers 
[64], visibly pushdown automata [36], and categorical generalisations of au- 
tomata [62,29,12,18]. It is fair to say that L*-like algorithms completely dominate 
the research area of active automata learning. 

In this paper we present L*, a fresh approach to automata learning that differs 
from L* and its descendants. Instead of focusing on equivalence of observations, 
L* tries to establish apartness, a constructive form of inequality [61,26]. The 
notion of apartness is standard in constructive real analysis and goes back to 
Brouwer, with Heyting giving an axiomatic treatment in [31]. This change in 
perspective has several key consequences, developed and presented in this paper: 


— L* does not maintain auxiliary data structures such as observation tables or 
discrimination trees, but operates directly on the observation tree. This tree 
is a partial Mealy machine itself, and is very close to an actual hypothesis 
that can be submitted to the teacher. As a result, our algorithm is simple. 

— The asymptotic query complexity of L# is O(kn? + nlog m) and the asymp- 
totic symbol complexity? is O(kmn? + nmlogm). Here k is the number of 
input symbols, n is the number of states, and m is the length of the longest 
counterexample. These are the same asymptotic complexities as the best 
existing (L*-like) learning algorithms [52,53,32,37,36,25]. 

— The use of observation trees as primary data structure makes it easy to 
integrate concepts from conformance testing to improve the performance 
of L#. In particular, adaptive distinguishing sequences [39], which we can 
compute directly from the observation tree, turn out to be an effective boost 
in practice, even if their use does not affect asymptotic complexities. Through 
L* testing and learning become even more intertwined [13,4]. 


' Except for the ZQ algorithm of [50], which was developed independently, and the 
ADT algorithm of [25], that was developed later and uses adaptive distinguishing 
sequences which are not covered in Isberner’s framework. 

? The symbol complexity is the number of input symbols required to learn an automaton. 
This is a relevant measure for practical learning scenarios, where the total time needed 
to learn a model is proportional to the number of input symbols. 
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— Experiments on benchmarks of [47], with a prototype implementation written 
in Rust, suggest that L* is competitive with existing, highly optimized 
algorithms implemented in LearnLib [51]. 


Related work. Despite the different data structures, L* and L* [5] still have 
many similarities, since both store all the information gained from all queries so 
far. Moreover, both maintain a set of those states that have been learned with 
absolute certainty already. A few other algorithms have been proposed that follow 
a different approach than L*. Meinke [43,44] developed a dual approach where, 
instead of starting with a maximally coarse approximating relation and refining 
it during learning, one starts with a maximally fine relation and coarsens it by 
merging equivalence classes. Although Meinke reports superior performance in 
the application to learning-based testing, these algorithms have exponential worst- 
case query complexities. Using ideas from [53], Groz et al. [27] use a combination 
of homing sequences and characterization sets to develop an algorithm for active 
model learning that does not require the ability to reset the system. Via an 
extensive experimental evaluation involving benchmarks from [47] they show 
that the performance of their algorithm is competitive with the L* descendant of 
[56], but there can be huge differences in the performance of their algorithm for 
models that are similar in size and structure. Several authors have explored the 
use of SAT and SMT solvers for obtaining learning algorithms, see for instance 
[49,58], but these approaches suffer from fundamental scalability problems. In 
a recent paper, Soucha & Bogdanov [60] outline an active learning algorithm 
which also takes the observation tree as the primary data structure, and use 
results from conformance testing to speed up learning. They report that an 
implementation of their approach outperforms standard learning algorithms 
like L*, but they have no explicit apartness relation and associated theoretical 
framework. It is precisely this theoretical underpinning which allowed us to 
establish complexity and correctness results, and define efficient procedures for 
counterexample processing and computing adaptive distinguishing sequences. 

In the present paper, we first define partial Mealy machines, observation trees, 
and apartness (Section 2). Then, we present the full L# algorithm (Section 3) 
and benchmark our prototype implementation (Section 4). The proofs of all 
theorems and complete benchmark results can be found in the appendix of the 
full version [65] of this paper. 


2 Partial Mealy Machines and Apartness 


The L* algorithm learns a hidden (complete) Mealy machine, and its primary 
data structure is a partial Mealy machine. We first fix notation for partial maps. 

We write f: X — Y to denote that f is a partial function from X to Y 
and write f(x)} to mean that f is defined on x, that is, dy € Y: f(x) = y, 
and conversely write f(a)? if f is undefined for x. Often, we identify a partial 
function f: X — Y with the set {(z,y) € X xY | f(x) = y}. The composition of 
partial maps f: X — Y and g: Y — Z is denoted by go f: X — Z, and we have 


226 F. Vaandrager et al. 


(go f)(x), iff f(x)} and g(f(x))}. There is a partial order on X — Y defined by 
f Cg for f,g: X —Y if for all z € X, f(x) implies g(x){ and f(x) = g(x). 
Throughout this paper, we fix a finite set I of inputs and a set O of outputs. 


Definition 2.1. A Mealy machine is a tuple M = (Q, qo, ô, A), where 

— Q is a finite set of states and qo € Q is the initial state, 

— (A,6):Qx I — O xQ is a partial map whose components are an output 
function à: Q x I — O and a transition function 6: Q x I — Q (hence, 
0(q,t)L = Alq, i), Jorq E Q andie I). 

We use superscript M to disambiguate to which Mealy machine we refer, e.g. 
QM, a4, 6M and AM. We write q = qd, forg,¢d €Q, i€, 0€ O to denote 
Alq, i) = o and 5(q,i) = q'. We call M complete if 5 is total, i.e., d(q,%) is 
defined for all states q and inputs i. We generalize the transition and output 
functions to input words of length n € N by composing (\, 6) n times with itself: 
we define maps (An, ôn): Q x I” + O” x Q by (Ao, 50) = ido and 


(An,On) Xidr idon X(A,6 
pS eS 


(Anti n41): QX I+ oO”xQxI ) or x Q 


Whenever it is clear from the context, we use and ô also for words. 


Definition 2.2. The semantics of a state q is a map |q]: I* — O* defined by 
lal(c) = A(q,c). States q,q' in possibly different Mealy machines are equivalent, 
written q = q', if lq] = [q’]. Mealy machines M and N are equivalent if their 
respective initial states are equivalent: qi ~ a. 


In our learning setting, an undefined value in the partial transition map 
represents lack of knowledge. We consider maps between Mealy machines that 
preserve existing transitions, but possibly extend the knowledge of transitions: 


Definition 2.3. For Mealy machines M and N, a functional simulation 
f: MON is a map f: QM > Q“ with 
i/o g : i/o 
F) S= and: q 4 implies f(q) > Fld’). 


Intuitively, a functional simulation preserves transitions. In the literature, a 
functional simulation is also called refinement mapping [3]. 


Lemma 2.4. For a functional simulation f: M —> N and q € Q™, we have 


lal E [FD]. 


For a given machine M, an observation tree is simply a Mealy machine itself 
which represents the inputs and outputs we have observed so far during learning. 
Using functional simulations, we define it formally as follows. 


Definition 2.5 ((Observation) Tree). A Mealy machine T is a tree if for 
each q € QT there is a unique sequence o € I* s.t. ôT (qg ,c) = q. We write 
access(q) for the sequence of inputs leading to q. A tree T is an observation 
tree for a Mealy machine M if there is a functional simulation f: T > M. 
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Fig. 1: An observation tree (left) for a Mealy machine (right). 


Figure 1 shows an observation tree for the Mealy machine displayed on the 
right. The functional simulation f is indicated via coloring of the states. 

By performing output and equivalence queries, the learner can build an 
observation tree for the unknown Mealy machine M of the teacher. However, 
the learner does not know the functional simulation. Nevertheless, by analysis of 
the observation tree, the learner may infer that certain states in the tree cannot 
have the same color, that is, they cannot be mapped to same states of M by a 
functional simulation. In this analysis, the concept of apartness, a constructive 
form of inequality, plays a crucial role [61,26]. A similar concept has previously 
been studied in the context of automata learning under the name inequivalence 
constraints in work on passive learning of DFAs, see for instance [15,24]. 


Definition 2.6. For a Mealy machine M, we say that states q,p € Q™ are 
apart (written q # p) if there is some o € I* such that [q](c)L, [p](o)L, and 
[al(c) 4 [p](c). We say that o is the witness of q # p and write o F q # p. 


Note that the apartness relation # C Q x Q is irreflexive and symmetric. A 
witness is also called separating sequence [59]. For the observation tree of Figure 1 
we may derive the following apartness pairs and corresponding witnesses: 


aF to #ts ak to # ts bak to F te 


The apartness of states q # p expresses that there is a conflict in their semantics, 
and consequently, apart states can never be identified by a functional simulation: 


Lemma 2.7. For a functional simulation f: T > M, 


q#pinT => f@#f(p)inM — for allg,peEQ’. 


Thus, whenever states are apart in the observation tree 7, the learner knows 
that these are distinct states in the hidden Mealy machine M. 

The apartness relation satisfies a weaker version of co-transitivity, stating 
that if o F r # r’ and q has the transitions for ø, then q must be apart from at 
least one of r and r’, or maybe even both: 


Lemma 2.8 (Weak co-transitivity). In every Mealy machine M, 


oFr#r A ål o} = r#qaVr#¢ for allr,r’,qeQhraoel*. 
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We use the weak co-transitivity property during learning. For instance in Fig. 1, by 
posing the output query aba, consisting of the access sequence for tı concatenated 
with the witness ba for to # t2, co-transitivity ensures that to # tı or t2 # t1. 
By inspecting the outputs, the learner may conclude that to # tı. 


3 Learning Algorithm 


The task solved by L* is to find a strategy for the learner in the following game: 


Definition 3.1. In the learning game between a learner and a teacher, the 
teacher has a complete Mealy machine M and answers the following queries from 
the learner: 


OUTPUTQUERY (o): For o € I*, the teacher replies with the corresponding 
output sequence AM (qf, o) € O*.° 

EQUIVQUERY(H): For a complete Mealy machine H, the teacher replies yes 
if Hœ M or no, providing some o € I* with AM (q, o) # A” (qt, 0). 


Our L* algorithm operates on an observation tree T = (Q, qo, ô, À) for the 
unknown complete Mealy machine M, where 7 contains the results of all output 
and equivalence queries so far. An observation tree is similar to the cache which 
is commonly used in implementations of L*-based learning algorithms to store 
the answers to previously asked queries, avoiding duplicates [10,42]. But whereas 
for L*-based learning algorithms the cache is an auxiliary data structure and 
only used for efficiency reasons, it is a first-class citizen in L*. 


Remark 3.2. The learner has no information about the teacher’s hidden Mealy 
machine. In particular, whenever we write #, we always refer to the apartness 
relation on the observation tree T. 


The observation tree is structured in a very similar way as Dijkstra’s shortest 
path algorithm [19] structures a graph. Recall that during the execution of 
Dijkstra’s algorithm ‘the nodes are subdivided into three sets’ [19]: 


1. the nodes $ to which a shortest path from the initial node is known. S 
initially only contains the initial node and grows from there. 

2. the nodes F from which the next node to be added to S will be selected. 

3. the remaining nodes. 


This scheme adapts to the observation tree as follows and is visualized in Fig. 2a. 


1. The states § C Q7, which already have been fully identified, i.e. the learner 
found out that these must represent distinct states in the teacher’s hidden 
Mealy machine. We call S the basis. Initially, S := {qj}, and throughout 
the execution S forms a subtree of 7 and all states in S are pairwise apart: 


Vp,q E S,p#q: pHa. 


3 In fact, later on we will assume that the teacher responds to slightly more general 
output queries to enable the use of adaptive distinguishing sequences, see Section 3.5. 
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H 
eL 


(b) A choice h: F > S (c) Hypothesis H for h 


Fig. 2: From the observation tree to the hypothesis (|I| = 2) 


2. the frontier F C QT, from which the next node to be added to S is chosen. 
Throughout the execution, F is the set of immediate non-basis successors of 
basis states: F := {qd €Q\S|AqeSiel:d =6(qg,i)}. 

3. the remaining states Q \ (S U F). 


Initially, T consists of only an initial state qf with no transitions. For every 
OUTPUTQUERY (o) during the execution, the input ø € I* and the corresponding 
response of type O* is added automatically to the observation tree 7, and 
similarly every negative response to a EQUIVQUERY leads to new states and 
transitions in the observation tree. With every extension 7” of the observation 
tree 7, the apartness relation can only grow: whenever p # q in 7, then still 
p Æq in T’. Thus, along the learning game, T and # grow steadily: 


Assumption 3.3 We implicitly require that via output and equivalence queries, 
the observation tree T and the basis S are gradually extended, with the frontier 
F automatically moving along while S grows. 


3.1 Hypothesis construction 


At almost any point during the learning game, the learner can come up with a 
hypothesis H based on the knowledge in the observation tree T. Since the basis 
S contains the states already discovered, the set of states of such a hypothesis is 
simply set to Q” := S, and it contains every transition between basis states (in 
T). The hypothesis must also reflect the transitions in 7 that leave the basis S, 
i.e. the transitions to the frontier. Those are resolved by finding for every frontier 
state a base state, for which the learner conjectures that they are equivalent 
states in the hidden Mealy machine. This choice boils down to a map h: F > S 


(> in Fig. 2b). Then, a transition q a pin T with q € S, p E€ F leads toa 
transition q ae h(p) in H (Fig. 2c). These ideas are formally defined as follows. 
Definition 3.4. Let T be an observation tree with basis S and frontier F. 


1. A Mealy machine H contains the basis if Q” = S and 8™ (qt, access(q)) = 
q for allq E€ S. 
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2. A hypothesis is a complete Mealy machine H containing the basis such that 


g 2+ of in H (qe S) and q 2p in T imply 0 =o! and 7(p # p’) (in T). 
3. A hypothesis H is consistent if there is a functional simulation f: T > H. 


4. For a Mealy machine H containing the basis, an input sequence o € I* is 
said to lead to a conflict if 57 (qj ,o) # ô” (qt, 0) (in T). 


Intuitively, the first three notions describe how confident we are in the correctness 
of the ‘back loops’ in H obtained from a choice h: F —> S. Notion 1 does not 
provide any warranty, notion 2 asserts that =(q # h(q)) for all q € F, and 
notion 3 (by definition) means that 7 is an observation tree for H, that is, all 
observations so far are consistent with the hypothesis H. The learner can verify 
the consistency of a hypothesis without querying the teacher (algorithm is in 
Section 3.3 below). The existence and uniqueness of a hypothesis are related to 
criteria on 7: 


Definition 3.5. In an observation tree T, a state in F is 1. isolated if it is 
apart from all states in S and 2. is identified if it is apart from all states in S 
except one. 3. The basis S is complete if each state in S has a transition for 
each input in I. 


Lemma 3.6. For an observation tree T, if F has no isolated states then there 
exists a hypothesis H for T. If S is complete and all states in F are identified 
then the hypothesis is unique. 


With a growing observation tree 7, the hidden Mealy machine is found as 
soon as the basis is big enough: 


Theorem 3.7. Suppose T is an observation tree for a (hidden) Mealy machine 
M such that S is complete, all states in F are identified, and |S| is the number 
of equivalence classes of xM. Then H ~ M for the unique hypothesis H. 


The theorem itself is not necessary for the correctness of L*, but guarantees 
feasibility of learning. 


3.2 Main loop of the algorithm 


The L* algorithm is listed in Algorithm 1 in pseudocode. The code uses Dijkstra’s 
guarded command notation [20], which means that the following rules are applied 
non-deterministically until none of them can be applied anymore: 


(R1) If F contains an isolated state, then this means that we have discovered a 
new state not yet present in S, hence we move it from F to S. 

(R2) When a state q € S has no outgoing i-transition, for some i € I, the output 
query for access(q) i will add the generated i successor, implicitly extending 
the frontier F. 
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Algorithm 1 Overall L# algorithm 
procedure LSHARP 


do q isolated, for some q E€ F > Rule (R1 
S + SU {q} 

( ôT (q, i) +, for some q E€ S, i € I > Rule (R2 
OUTPUTQUERY (access(q) i) 

U= Ær), -(¢ #1’), for some q E F, r,r’ € S, r Ar’ > Rule (R3 
o + witness of r # r’ 
OUTPUTQUERY (access(q) o) 

[| F has no isolated states and basis S is complete + Rule (R4 


H + BUILDHYPOTHESIS 
(b, o) 4+ CHECKCONSISTENCY (H) 
if b = yes then 
(b, p) <— EQUIVQUERY (H) 
if b = yes then: return H 
else: o + shortest prefix of p such that 8” (qt, o) # 57 (qd ,o) (in T) 
end if 
PROCCOUNTEREX(H, a) 
end do 
end procedure 


(R3) When q € F is a state in the frontier that is not yet identified, then there are 
at least two states in S that are not apart from q. In this case, the algorithm 
picks a witness ø € I* for r #r’. After the OUTPUTQUERY (access(q) a), the 
observation tree is extended and thus q will be apart from at least r or r’ by 
weak co-transitivity (Lemma 2.8). 

(R4) When F has no isolated states and S is complete, BUILDHYPOTHESIS 
picks a hypothesis H (at least one exists Lemma 3.6). If H is not consistent 
with observation tree 7 we get a conflict o for free. Otherwise, we pose an 
equivalence query for H. If the hypothesis is correct, L* terminates, and 
otherwise we obtain a counterexample p. The counterexample decomposes 
into two words øn, where ø leads to a conflict and 7 witnesses it. The conflict 
g means that one of the frontier states was merged with an apart basis state 
in H, causing a wrong transition in H. Since o can be very long, the task 
of PROCCOUNTEREX(c) is to shorten o until we know which frontier state 
caused the conflict. So after PROCCOUNTEREX, H is not a hypothesis for 
the updated 7 anymore. 


We will show the correctness of L# in a top-down approach discussing the 
subroutines later and only assuming now that: 


1. BuILDHYPOTHESIS picks one of the possible hypotheses (Lemma 3.6) 

2. CHECKCONSISTENCY(H,) tells if there is a functional simulation T > H, and 
if not, provides o € I* leading to a conflict (Lemma 3.10 below). 

3. If H contains the basis and ø leads to a conflict, then PROCCOUNTEREX(H, 0), 
extends 7 such that H is not a hypothesis anymore (Lemma 3.11 below). 
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Whenever the algorithm terminates, the learner has found the correct model. 
Therefore, correctness amounts to showing termination. The rough idea is that 
each rule will let S, F, or # restricted to S x F grow, and each of these sets are 
bounded by the hidden Mealy machine M. We define the norm N(T) by 


ISl: (S| +1) 


+ H{(aieSxIlo7@)G] + Hlad) E Sx F| aH () 


The first summand increases whenever a state is moved from F to S (R1); it is 
quadratic in |S| because (R1) reduces the third summand. The second summand 
records the progress achieved by extending the frontier (R2). The third summand 
counts how much the states in the frontier are identified (R3). Rule (R4) extends 
the apartness relation, leading to an increase of the third summand. 


Theorem 3.8. Every rule application in L# increases the norm N(T) in (1). 


The norm N(7) and therefore also the number of rule applications is bounded: 


Theorem 3.9. If 7 is an observation tree for M with n equivalence classes of 
states and |I| =k, then N(T) < $-n-(n+1)+kn+(n—1)(kn +1) € O(kn?). 


At any point of execution, either rule (R1), (R2), or (R4) is applicable, so L# 
never blocks. As soon as the norm N(7) hits the bound, the only applicable rule 
is rule (R4) with the teacher accepting the hypothesis. Thus, the correct Mealy 
machine is learned within O(k - n?) rule applications. The complexity in terms of 
the input parameters is studied in Section 3.6. 

We now continue defining the subroutines and proving them correct. 


3.3 Consistency checking 


A hypothesis H is not necessarily consistent with 7, in the sense of a functional 
simulation 7 — H. Via a breadth-first search of the Cartesian product of T 
and H (Algorithm 2), we may check in time linear in the size of 7 whether a 
functional simulation 7 —> H exists. In the negative case, we obtain o € I* 
leading to a conflict without any equivalence or output query to the teacher 
needed. Thus, this is also called ‘counterexample milking’ [10]. 


Lemma 3.10. Algorithm 2 terminates and is correct, that is, if H is a hypothesis 
for T with a complete basis, then CHECKCONSISTENCY(H) 

1. returns yes, if H is consistent, 

2. returns no and p € I*, if p leads to a conflict (57 (qg , p) # 8” (qe, p) in T). 


3.4 Counterexample processing 


The L* algorithm [5] performs O(m) queries to analyze a counterexample of 
length m. So if a teacher returns really long counterexamples, their analysis will 
dominate the learning process. Rivest & Schapire [52,53] improve counterexample 
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Algorithm 2 Check if hypothesis H is consistent with observation tree T 


procedure CHECKCONSISTENCY(H) 
Q + new queue C SxS 
enqueue(Q, (qd ,96'))) 
while (q,r) + dequeue(Q) 
if q # r then: return no: access(q) 


for allg #5 p in T do 
enqueue(Q, (p, 8” (r, i))) 
end for 
end while 
return yes 
end procedure 


analysis of L* using binary search, requiring only O(log m) queries. A similar 
trick is applied in L#. 

Suppose ø leads to a conflict q # r for q = ô” (qt, o) and r = 67 (qJ,c). 
Then, PROCCOUNTEREX(¢) (Algorithm 3) extends T such that H will never be 
a hypothesis for T again. 

If r e SUF, then the conflict q # r is obvious and H is not a hypothesis 
again. If otherwise r ¢ S U F, the binary search will successively reduce the 
number of transitions of ø outside S U F by a factor of 2 until we reach the 
above base case SU F. Let c1 02 := o such that the run of cı in 7 ends halfway 
between the frontier and r. By an additional output query, the binary search 
checks whether already gı leads to a conflict. In the two cases, we can either 
avoid g1 or 02, so we reduce the number of transitions outside S U F to half the 
amount. The precise argument is in: 


Lemma 3.11. Suppose basis S is complete, H is a complete Mealy machine 
containing the basis, and o € I* leads to a conflict. Then PROCCOUNTEREX(H, 0) 
terminates, performs at most O(log, |o|) output queries and is correct: upon 
termination, the machine H is not a hypothesis for T anymore. 


3.5 Adaptive distinguishing sequences 


As an optimization in practice, we may extend the rules (R2) and (R3) by 
incorporating adaptive distinguishing sequences (ADS) into the respective output 
queries. Adaptive distinguishing sequences, which are commonly used in the 
area of conformance testing [39], are input sequences where the choice of an 
input may depend on the outputs received in response to previous inputs. Thus, 
strictly speaking, an ADS is a decision graph rather than a sequence. This mild 
extension of the learning framework reflects the actual black box behaviour of 
Mealy machines: for every input in J sent to the hidden Mealy machine, the 
learner observes the output O before sending the next input symbol. Use of 
adaptive distinguishing sequences may reduce the number of output queries that 
are required for the identification of frontier states. 
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pe 


Algorithm 3 Processing ø that leads to a conflict, i.e. 6” (qo,0) #6 


procedure PROCCOUNTEREX(H, o € I”) 
q+ 6* (q0, 0) 
re ôT (afo) 
if re SUF then 
return 
else 
p < unique prefix of o with 57 (qd , p) € F 
he |ie] 
o1 + o[1..h] 
o2 + o|h +1..|o|] 
q' 4+ 8™ (q8, 01) 
r' — 67 (qd, 01) 
n < witness for q # r 
OUTPUTQUERY (access(q’) o2 n) 
if q # r’ then 
PROCCOUNTEREX (H, o1) 
else 
PROCCOUNTEREX (H, access(q’) o2) 
end if 
end if 
end procedure er 


As an example, consider the observation tree of Figure 3(left). The basis for 
this tree consists of 5 states, which are pairwise apart (separating sequences are 
a, ab and aa). Frontier states can be identified by the single adaptive sequence 
of Figure 3(right). The ADS starts with input a. If the response is 2 we have 
identified our frontier state as t4. If the response is 0 then the frontier state 
is either tọ or t2, and we may identify the state with a subsequent input a. 
Similarly, if the response is 1 then the frontier state is either tı or t3, and we 
may identify the state by a subsequent input b. We can therefore identify (or 
isolate) frontier state ts with a single (extended) output query that starts with 
the access sequence for ts (bbbba) followed by the ADS of Figure 3(right). If we 
used separating sequences, we would need at least 2 output queries. 

In the setting of L*, we can directly compute an optimal ADS from the 
current observation tree. To this end, we recursively define an expected reward 
function Æ, which sends a set U C QT of states to the maximal expected number 
of apartness pairs (in the absence of unexpected outputs). 


us i i/o i/o 
E(U)= max pee ee (2) 


o€O U > | 


where inp(U) := {ie I| Age U : 67 (q,i)L }, U$ := {q € U | 67 (q, i)4 } and 


(a {7 EQT |aSaeU:¢ KAA q'}. We define the maximum over the empty 
set to be 0. Then Aps(U) is the decision tree constructed as follows: 
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Fig. 3: An observation tree (left) and an ADS for its basis (right) 


— If UŻ = Í then Aps(U) consists of a single node U without a label. 
— If UŠ 40 then Aps(U) is constructed by choosing an input i that witnesses 
the eee E(U), creating a node U with label i, and, for each output o 


with Us # ý, adding an o-transition to ADS(U ale, =). 
For the observation tree of Figure 3(left) we may compute E({to,...,t4}) = 4 
and obtain the decision tree of Figure 3(right) as ADS. Running the ADS from 


state t5 will create 4 new apartness pairs with basis states (or 5 in case an 
unexpected output occurs, e.g. a(1)b(2)). 


Proposition 3.12. Define i by replacing the output queries in LË with 


(R2’) OuTPUTQUERY(access(q) i ADS(S)) in (R2) and 
(R3’) OuTPUTQUERY (access(q) ADS({b € S | a(b # q)})) in (R3). 


Then, LŽ s lets the norm N(T) grow for each rule application and thus is correct. 


3.6 Complexity 


Since equivalence queries are costly in practice and since processing of long 
counterexamples of length m requires O(log m) output queries, it makes sense to 
postpone equivalence queries as long as possible: 


Definition 3.13. Strategic L* (resp. Ly.) is the special case of Algorithm 1 
where rule (R4) is only applied if none of the other rules is applicable. 


Then we obtain the following query complexity for the L* algorithm. 


Theorem 3.14. Strategic L# (resp. LË a) learns the correct Mealy machine 
within O(kn? + nlogm) output queries and at most n — 1 equivalence queries. 


The query complexity of L# equals the best known query complexity for 
active learning algorithms, as achieved by Rivest & Schapire’s algorithm [52,53], 
the observation pack algorithm [32], the TTT algorithm [37,36], and the ADT 
algorithm [25]. 
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In a black box learning setting in practice, answering an output query for 
ao € I* grows linearly with the length ø. Therefore, the (asymptotic) total 
number of input symbols sent by the learner is also a metric for comparing 
learning algorithms: 


Theorem 3.15. Let n € O(m). Then the strategic L* algorithm learns the 
correct Mealy machine with O(kmn? + nmlogm) input symbols. 


This matches the asymptotic symbol complexity of the best known active 
learning algorithms. Although PROCCOUNTEREX reduces the length of the 
sequence leading to the conflict, the witness of the conflict remains of size O(m) 
in the worst case. This means that we need O(mlogm) symbols to process a 
single counterexample and O(nm log m) symbols to process all counterexamples. 


4 Experimental Evaluation 


In the previous sections, we have introduced and discussed the L* algorithm. We 
now present a short experimental evaluation of the algorithm to demonstrate its 
performance when compared to other state-of-art algorithms. We run two versions 
of L#: the base version (Algorithm 1), and the ADS optimised variant (or), and 
compare these with the (highly optimized) LearnLib* implementations of TTT, 
ADT,? and ‘RS’, by which we refer to L* with Rivest-Schapire counterexample 
processing [52,53]. All source-code and data is available online.’ 


Implementing EQUIVQUERY: We implement equivalence queries using confor- 
mance testing, which also makes output queries. We have fixed the testing tool 
to Hybrid-ADS" [57]. Hybrid-ADS has multiple configuration options, and we 
have set the state cover mode to “buggy”, the number of extra states to check for 
to 10, the number of infix symbols to 10, and the mode of execution to “random”, 
generating an infinite test-suite. Note that with these settings, the equivalence 
queries are not exact in general but approximated via random testing. 


Data-set and metrics: We use a subset of the models available from the Au- 
tomataWiki (see [47]): we learn models for the SSH, TCP, and TLS protocols, 
alongside the BankCard models. The largest model in this subset has 66 states 
and 13 input symbols. We record the number of output queries and input symbols 
used during learning and testing, alongside the number of equivalence queries 
required to learn each model. An output query is a sequence o € I* of |o| input 
symbols and one reset symbol. A reset symbol returns the system under test 
(SUT) to its initial state. So resets denotes the number of output queries and 
inputs denotes the total number of symbols sent to the SUT. We believe that 
these metrics accurately portray the effort required to learn a model. 


* https: //learnlib.de/ 

5 The ADT algorithm makes use of some heuristics to guide the learning process, we 
have selected the “Best-Effort” settings. 

6 https: //gitlab.science.ru.nl/sws/Isharp and 10.5281/zenodo . 5735533 

7 https://github.com/Jaxan/hybrid-ads 
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Experiment Set-up: All experiments were run on a Ryzen 3700X processor with 
32GB of memory, running Linux. Each experiment refers to completely learning a 
model of the SUT. Due to the effects of randomization in the equivalence oracle, 
we repeat each experiment 100 times. 
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(a) Symbols used during learning phase (b) Symbols used both learning and testing 


Fig. 4: Performance plots of the selected learning algorithms (lower is better.) 


Results and Discussion Fig. 4a shows the total size of data sent by the learning 
algorithms via output queries — so both the number and the size of output queries 
are counted. In order to incorporate the equivalence queries, Fig. 4b shows the 
total size of data sent to the SUT during learning and testing. Note, in both 
plots the y-axis is log-scaled. The x-axis indicates the models, sorted in increasing 
number of states. The bars indicate standard deviation. 

We can observe from the learning phase plot (Fig. 4a) that L* expectedly does 
not perform better than the TTT and ADT algorithms, while the RS algorithm 
performs the worst among all four. However, LŽ s usually performs better than — 
or, at least, is competitive with - ADT and TTT. Furthermore, the error bars in 
the learning phase are very small, indicating that the measurements are stable. 
Generally, depending on the models a different algorithm is the fastest, but for 
every model, LË is among the fastest, with and without the exclusion of the 
testing phase. 

Fig. 4b presents the total number of input symbols and resets sent to the SUT. 
All algorithms seem to be very close in performance, which may be explained by 
the testing phase dominating the process. Indeed, Aslam et al. [8] experimentally 
demonstrated that it is largely the testing phase which influences learning effort. 

The complete benchmark results (in the appendix of [65]) show more detailed 
information of the learned models, and highlights the smallest number per column 
and model. We can see that the number of equivalence queries are roughly similar 
for almost all the algorithms, while L# seems to perform better for some models 
in the learning phase. 
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5 Conclusions and Future Work 


We presented L#, a new algorithm for the classical problem of active automata 
learning. The key idea behind the approach is to focus on establishing apartness, 
or inequivalence of states, instead of approximating equivalence as in L* and its 
descendants. Concretely, the table/discrimination tree in L*-like algorithms is 
replaced in L* by an observation tree, together with an apartness relation. This 
change in perspective leads to a simple but effective algorithm, which reduces the 
total number of symbols required for learning when compared to state-of-the-art 
algorithms. In particular, the use of observation trees, which are essentially 
tree-shaped Mealy machines, enables a modular integration of testing techniques, 
such as the ADS method, to identify states. Although the asymptotic output 
query complexity of L# is O(kn? + nlog m), in our experiments L* only needs 
in between kn and 4kn output queries (resets) to learn the benchmark models 
(with n < 66), which means that on average L# needs in between 1 and 4 output 
queries to identify a frontier state. 

Of course there are also similarities between L* and L*. The basis of L* is 
comparable to the top half of the L* table: both in L# and in ([53]’s version of) 
L* these prefixes induce a spanning tree. The frontier of L* is comparable to 
the bottom half of the L* table. But whereas L* constructs residual classes of 
the language, L% builds an automaton directly from the observation tree. As a 
consequence, L* asks redundant queries, and optimizations of L* try to avoid 
this redundancy. In contrast, L* does not even think about asking redundant 
queries since it operates directly on the observation tree and only poses queries 
that increase the norm. 

There is still much work to do to improve our prototype implementation, to 
include additional conformance testing algorithms, and to extend the experimental 
evaluation to a richer set of benchmarks and algorithms. One issue that we need 
to address is scaling of L# to bigger models. Our prototype implementation 
easily learns Mealy machines with hundreds of states, but fails to learn larger 
models such as the ESM benchmark of [57] (3410 states, 78 inputs) because the 
observation tree becomes too big (25 million nodes will be required for the 
ESM). We see several ways to address this issue, e.g., pruning the observation 
tree, only keeping short ADSs to separate the basis states, storing parts of the 
tree on disk, distributing the tree over multiple processors (parallelizing the 
learning process), and using existing platforms for big graph processing [54]. 

Aslam et al. [9] report on experiments in which active learning techniques 
are applied to 202 industrial software components from ASML. Out of these, 
interface protocols could be successfully derived for 134 components (within a 
give time bound). One of the main conclusions of the study is that the equivalence 
checking phase (i.e. conformance testing of hypothesis models) is the bottleneck 
for scalability in industry. We believe that a tighter integration of learning and 
testing, as enabled by L*, will be key to address this challenging problem. 

It will be interesting to extend L* to richer frameworks such as register 
automata, symbolic automata and weighted automata. In fact, we discovered L#* 
while working on a grey-box learning algorithm for symbolic automata. 
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Abstract. We present a new learning algorithm for realtime one-counter 
automata. Our algorithm uses membership and equivalence queries as 
in Angluin’s L* algorithm, as well as counter value queries and partial 
equivalence queries. In a partial equivalence query, we ask the teacher 
whether the language of a given finite-state automaton coincides with a 
counter-bounded subset of the target language. We evaluate an imple- 
mentation of our algorithm on a number of random benchmarks and on 
a use case regarding efficient JSON-stream validation. 


Keywords: Realtime one-counter automata - Active learning 


1 Introduction 


In active learning, a learner has to infer a model of an unknown machine by in- 
teracting with a teacher. Angluin’s seminal L* algorithm does precisely this for 
finite-state automata while using only membership and equivalence queries [4]. 
An important application of active learning is to learn black-box models from 
(legacy) software and hardware systems [17,28]. Though recent works have greatly 
advanced the state of the art in finite-state automata learning, handling real- 
world applications usually involves tailor-made abstractions to circumvent ele- 
ments of the system which result in an infinite state space [1]. This highlights 
the need for learning algorithms that focus on more expressive models. 
One-counter automata (OCAs) are obtained by extending finite-state au- 
tomata with an integer-valued variable that can be increased, decreased, and 
tested for equality against zero. The counter allows OCAs to capture the be- 
havior of some infinite-state systems. Additionally, their expressiveness has been 
shown sufficient to verify programs with lists [10] and validate XML streams [13]. 
To the best of our knowledge, there is no learning algorithm for general OCAs. 
For visibly OCAs (that is, when the alphabet is such that letters determine 
whether the counter is decreased, increased, or not affected), Neider and Léding 
describe an algorithm in [27]. Besides the usual membership and equivalence 
queries, they use partial equivalence queries: given a finite-state automaton A 
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and a bound k, does the language of A correspond to the k-bounded subset of 
the target language? Additionally, Fahmy and Roos [15] claim to have solved the 
case of realtime OCA (i.e., when the automaton is assumed to be configuration- 
deterministic and no é-transitions are allowed). However, we were unable to 
understand the algorithm and proofs in that paper due to lack of precise for- 
malization and detailed proofs. We also found an example where the provided 
algorithm did not produce the expected results. It is noteworthy that Bohm et 
al. [8] made similar remarks about related works of Roos [6,30]. 


Our contribution. We present a new learning algorithm for realtime one-counter 
automata (ROCAs). Our algorithm uses membership, equivalence and partial 
equivalence queries. It also makes use of counter value queries. That is, we make 
the assumption that we have an executable black box with observable counter 
values. We prove that our algorithm runs in exponential time and space and that 
it uses at most an exponential number of queries. Due to lack of space, some 
proofs have been omitted. We refer the interested reader to the full technical 
report of this work [12]. 

In [9], Bollig establishes a connection between OCAs with counter-value ob- 
servability and visibly OCAs. We expose a similar connection and are thus able 
to leverage Neider and Léding’s learning algorithm for visibly one-counter lan- 
guages [27] as a sort of sub-routine for ours. Nevertheless, our learning algorithm 
is more sophisticated due to the fact that the counter values cannot be inferred 
from a given word. Technically, the latter required us to extend the classical def- 
inition of observation tables as used in, e.g., [4,27]. Entries in our tables are com- 
posed of Boolean language information as well as a counter value or a wildcard 
encoding the fact that we do not (yet) care about the value of the corresponding 
word. (Our use of wildcards is reminiscent of the work [25] on learning a regular 
language from an “inexperienced” teacher.) Moreover our tables need two sets 
of suffixes while only one is necessary in classical tables. Indeed we provide an 
example showing that making a table closed and consistent leads to an infinite 
loop when it has only one set of suffixes. Due to these extensions, much work is 
required to prove that it is always possible to make a table closed and consis- 
tent in finite time. Finally, we formulate queries for the teacher in a way which 
ensures the observation table eventually induces a right congruence refining the 
classical Myhill-Nerode congruence with counter-value information. Our com- 
putations and experiments show that the second set of suffixes is exponential 
leading to an exponential algorithm (instead of polynomial as in [27]). 

We evaluate an implementation of our algorithm on random benchmarks and 
a use case inspired by [13]. Namely, we learn an ROCA model for a simple JSON 
schema validator — i.e., a program that verifies whether a JSON document 
satisfies a given JSON schema. The advantage of having a finite-state model of 
such a validator is that JSON-stream validation becomes trivially efficient (cf. 
automata-based parsing [3]). 


Related work. Our assumption about counter-value observability means that 
the system with which we interact is a gray bor. Several recent works make 
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such assumptions to learn complex languages or ameliorate query-usage bounds. 
For instance, in [7], the authors assume they have information about the target 
language L in the form of a superset of it. Similarly, in [2], the authors assume 
L is obtained as the composition of two languages, one of which they know 
in advance. In [26], the teacher is assumed to have an executable automaton 
representation of the (infinite-word) target language and that some properties 
of this automaton are visible to the learner. Finally, in [16] it is assumed that 
constraints satisfied along the run of a system can be made visible. 


2 Preliminaries 


In this section we recall all necessary notions. We give a definition of realtime 
one-counter automaton adapted from [15,34]. We present the concept of behavior 
graph of such automata, inspired by the one given in [27] for visibly one-counter 
automata (VCAs), and state some important properties for our learning task. 

An alphabet X is a non-empty finite set of symbols. A word is a finite sequence 
of symbols from X, and the empty word is denoted by €e. The set of all words 
over X is denoted by X*. The concatenation of two words u,v € X* is denoted 
by uv. A language L is a subset of X*. Given a word w € X* and a language 
LC X*, the set of prefixes of w is Pref(w) = {u € &* | w € X*, w = wv} and 
the set of prefixes of L is Pref (L) = U „ez Pref (w). Similarly, we have the sets of 
suffices Suff (w) = {u € &* | du € &*,w = vu} and Suff(L) = Uwer Suf (w). 
Moreover, L is said to be prefix-closed (resp. suffir-closed) if L = Pref (L) (resp. 
L = Suff(L)). In this paper, we always work with non-empty languages L to 
avoid having to treat particular cases. 


Definition 1. A realtime one-counter automaton (ROCA) A is a tuple A = 
(Q, X, 6=0, 050,90, F) where: (1) X is an alphabet, (2) Q is a non-empty finite 
set of states, (3) qo E Q is the initial state, (4) F C Q is the set of final states, 
and (5) do and ô>o are two (total) transition functions defined as dao : QX X > 
Q x {0, +1} and 639: Q x X > Q x {-1,0, +1}. 


The second component of the output of 69 and dy gives the counter operation 
to apply when taking the transition. Notice that it is impossible to decrement 
the counter when it is already equal to zero. 

A configuration is a pair (q,n) € Q x N, that is, it contains the current state 
and the current counter value. The transition relation ae C (QxN)x 2x (QxN) 


contains (q, n) —“+(p,m) if and only a ae A x w=, 

A ô>olq,a) = (p) Am=n+c ifn>0. 

When the context is clear, we omit A to simply write +>. We lift the relation to 

words in the natural way. Notice that this relation is deterministic in the sense 

that given a configuration (q, n) and a word w, there exists a unique configuration 
(p,m) such that (q, n) = (p, m). 

Given a word w, let (qo,0) “+(q,n) be the run on w. When n = 0 and 

q € F, we say that this run is accepting. The language accepted by A is the set 
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L(A) = {w € &* | (qo, 0) “4(¢, 0) with q € F}. If a language L is accepted by 
some ROCA, we say that L is a realtime one-counter language (ROCL). 

Given w € X*, we define the counter value of w according to A, noted ca (w), 
as the counter value n of the configuration (q,n) such that (qo,0) “+(q,n). We 
define the height of w according to A, noted ha(w), as the maximal counter 
value among the prefixes of w, i.e., ha(w) = MaXze Pref (w) CA(2). 

We now introduce the concept of behavior graph of an ROCA A, inspired 
from the one given for VCAs in [27]. It is a (possibly infinite) automaton based on 
the congruence relation = over X* such that u = v if and only if for all w € X*, 
we have (1) uw € L & vw € L and (2) uw, vw € Pref (L) > ca(uw) = ca (vw). 
The equivalence class of u is denoted by [u]=. This relation = is a refinement 
of the Myhill-Nerode relation [18]. Its second condition depends on A and is 
limited to Pref (L) because even if A has different counter values for words not 
in Pref (L), we still require all those words to be equivalent. 


Definition 2. Let A = (Q, X, ô=0, ô>0, q0, F) be an ROCA accepting L C &*. 
The behavior graph of A is the automaton BG(A) = (Q=, X, ô=, q2, F=) where: 
(1) Q= = {[u]= | u € Pref(L)} is the set of states, (2) q2 = [ela is the initial 
state, (3) F= = {[u]= | u € L} is the set of final states, (4) d= : Q= x X > Q= 
is the transition function defined by: d=([u]J=, a) = [ua]z, V[ul=, [ua] Q=, 
Va E€ X. 


Note that Q= Æ Ý since L # @ by assumption. A straightforward induction 
shows that BG(A) accepts L = L(A). By definition, BG is trim (each state is 
reachable and co-reachable) which implies that the transition function is partial. 


Initial part Repeating part 


Fig. 1: An ROCA Fig. 2: Behavior graph of the left ROCA 
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Example 1. A 3-state ROCA A over X = {a,b} is given in Figure 1. The initial 
state go is marked by a small arrow and the two final states qı and q2 are double- 
circled. The transitions give the input symbol, the condition on the counter 
value, and the counter operation, in this order (d= is indicated in blue while 
dso is indicated in green). The run on w = aababaa is accepting since it ends 
with the configuration (q2,0). Moreover, c4(w) = 0 and ha(w) = 2. One can 
verify that L(A) = {w € {a,b} | dn > 0,5k,...,k, > 0,4u € {a,b}, w = 
ab(b'1a-+-bkna)u}. The behavior graph BG(A) of A is given in Figure 2. One 
can check that b = abba. Indeed Vw € X*, bw € L = abbaw € L. Moreover, Vw € 
X* such that bw,abbaw € Pref(L), we have c4(bw) = c,4(abbaw). However, 
ab £ aab since ab, aab € Pref (L) but c4(ab) = 14 ca(aab) = 2. 


We finally state two important properties of the behavior graph, useful for 
the learning of ROCAs. We first establish that BG(A) has a finite representation 
that relies on the fact that it has an ultimately periodic structure (see Figure 2). 
Let us introduce some notations. By definition of the states of BG(.A), all words 
in the same class [u]= have the same counter value. We thus define the level £ of 
BG(A) as the set of states with counter value Z. One can easily check that each 
level has a number of states bounded by |Q|. The minimal such bound is called 
the width of BG’, and is denoted by K. This observation allows to enumerate the 
states in level / using a mapping ve : {Jw]= € Q= | ca(w) = 2} > {1,..., K} 
Using these enumerations ve, £ € N, we can encode the transitions of BG(A) as 
a sequence of mappings Te : {1,..., K} x X > {1,..., K} x {-1,0,+1}, with 
L EN, as follows. For alli € {1,..., K}, a € X, the mapping re is defined as: 


(j c) if 3fu]=, [ua]= € Q= such that c4(u) = £, 


Te(i,a) = ca(ua) = l + c, ve(lu]=) = i, ve+c([ua]=) = j, 
undefined otherwise. 


In this way, the behavior graph can be encoded as the sequence @ = T97172..., 
called a description of BG(A). The following theorem states that there always 
exists such a description which is periodic (see again Figure 2). 


Theorem 1. Let A be an ROCA, BG(A) = (Q=, X, =, q}, F=) be the behav- 
ior graph of A, and K be the width of BG(A). Then, there exists a sequence 
of enumerations ve : {[uJ= E€ Q= | calu) = 2} > {1,...,K} such that the 
corresponding description a of BG(A) is an ultimately periodic word with offset 
m > 0 and period k > 0, i..e, & = 7)...Tm—1(Tm+++Tmtk-1) + 


This theorem is the counterpart of a similar theorem given in [27] for VCAs. 
We get this theorem thanks to an isomorphism between the behavior graph of 
an ROCA A and that of a suitable VCA constructed from A. 

The second major property states that from a periodic description of a behav- 
ior graph BG(A), one can construct an ROCA that accepts the same language. 


Proposition 1. Let A be an ROCA accepting a language L C X*, BG(A) be 
its behavior graph of width K, & = To . . .Tm—1(Tm - - -Tm+k—1)” be a periodic de- 
scription of BG(A) with offset m and period k. Then, from a, one can construct 
an ROCA Aa accepting L such that the size of Aa is polynomial in m, k and K. 
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3 Learning ROCAs 


The aim of this paper is to design an ROCA-learning algorithm. We suppose that 
the reader is familiar with the concept of active learning, and in particular with 
Angluin’s seminal L* algorithm for learning finite-state automata (DFAs) [4]. In 
this section, let us fix a language L C X* and an ROCA A such that L(A) = L. 
We here explain how a learner will learn L by querying a teacher. Our learning 
algorithm is inspired by the one designed in [27] for VCAs. The idea is to learn 
an initial fragment of the behavior graph BG(A) up to a fixed counter limit £, to 
extract every possible periodic description from the fragment, and to construct 
an ROCA from each of these descriptions. If we find one ROCA accepting L, we 
are done. Otherwise, we increase £ and repeat the process. 

Formally, the initial fragment up to @ is called the limited behavior graph 
BG,)(A). This is the subgraph of BG(A) whose set of states is {[w]= € Q= | 
ha(w) < £}. This DFA accepts the language Ly = {w € L | Va € Pref(w),0 < 
calx) < £}. Notice that BG,(A) is composed of the first l+ 1 levels of BG(A) 
(from 0 to £) such that each level is restricted to states [w]= with ha(w) < £. 

During the learning process, the teacher has to answer four different types of 
queries asked by the learner: (1) membership query (does w € X* belong to L’), 
(2) counter value query (given w € Pref(L), what is c4(w)?), (3) partial equiv- 
alence query (does the DFA B accept Le?), and (4) equivalence query (does the 
ROCA B accept L?). In case of negative answer to a (partial) equivalence query, 
the teacher provides a counterexample w € X* witness of this non-equivalence. 

Recall that membership and equivalence queries are used in the L* algo- 
rithm [4]. Additionally, partial equivalence queries are required in the VCA- 
learning algorithm of [27] to find the basis of a periodic description for the target 
automaton. However counter-value queries are not necessary because VCAs use 
a pushdown alphabet and the counter value can be directly inferred from the 
word. For general alphabets, this is no longer possible and the learner has to ask 
the teacher for this information. Our main result is the following theorem. 


Theorem 2. Let A be an ROCA accepting a language L C X*. Given a teacher 

for L, which answers membership, counter value, and (partial) equivalence queries, 
an ROCA accepting L can be computed in time and space exponential in |Q], || 

and t, with Q the set of states of A and t the length of the longest counterez- 

ample returned by the teacher on (partial) equivalence queries. The learner asks 

O(t?) partial equivalence queries, O(|Q\|t?) equivalence queries and a number of 
membership and counter value queries exponential in |Q|, |X| and t. 


In what follows we describe the main steps of our learning algorithm. Given a 
counter limit £, we first introduce the kind of observation table Ê; we use to store 
the learned information about Lz. Secondly, we explain what are the constraints 
imposed to Âp to derive a DFA Ag,, candidate for accepting Le. Thirdly, when 
a counterexample is returned by the teacher to a partial equivalence query with 
this DFA, we explain how to update the table. Fourthly, we give the whole 
learning algorithm such that when Ag, accepts Le with £ big enough, a periodic 
description a is finally extracted from Ag, such that the ROCA A, accepts L. 
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Observation Table and Approximation Sets. As for learning DFAs and 
VCAs, we use an observation table to store the data gathered during the learn- 
ing process. This table aims at approximating the equivalence classes of = and 
therefore stores information about both membership to L and counter values for 
words known to be in Pref(Z). It depends on a counter limit € N since we 
first want to learn BG( A). We highlight the fact that our table uses two sets of 
suffixes: § and S (contrarily to the algorithms of [4,27] that use only one set S). 
Intuitively, we use the set S to store membership information and the set S$ for 
counter value information. In [27], the set S is not needed as the counter value 
of a word can be immediately derived from the word. This is not the case for 
us, as the teacher’s ROCA is required to compute the counter value of a word. 
Therefore, we need to explicitly store that information. 


Definition 3. Let £ € N be a counter limit and Le be the language accepted by 
the limited behavior graph BGe(A) of an ROCA A. An observation table Ĝe 
up to L is a tuple (R,S, S, Le, Ce) with: (1) a finite prefiz-closed set R C &™ 
of representatives, (2) two finite suffiz-closed sets S, S of separators such that 
Sc Soc, (3) a function Le : (RU RS)S — {0,1}, (4) a function Ce : 
(RURS)S > {1,0,..., 2}. 

Let Pref(@¢) be the set {w € Pref (us) | u € RU RY,s € S, Llus) = 1}. 
Then for allu E€ RU RX the following holds: 


— foralls € 8, Llus) is 1 if us € Le and 0 otherwise, 
— for alls € S, Co(us) is ca(us) if us € Pref (C,) and L otherwise. 


In this definition, the domains of Ly and Cẹ are different as already mentioned. 
Notice that Pref (6) C Pref (L). To compute the values of the table 6p, the 
learner proceeds by asking membership and counter value queries to the teacher. 


E 
E E 0,0 
€ |a ba lal E 0,0 a 0,1 
E 0,0}0 1 E 0,0 0,1 ab 0,1 
a 0,1/0 1 a 0,1 ab |0,1 aba |1,0 
ab |0,1|1 1 ab |0,1 aba |1,0 abb |0,1 
aba |1,0|1 1 aba |1,0 abb |0,1 abbb |0,1 
aa |0,L/0 0 b 1,0 b 1,0 b 1,0 
b 1,0)1 1 aa |0,L aa |0,L aa {0,1 
abb |0,1}1 1 abb |0, L abaa| 1,0 abaa |1,0 
abaa|1,0|1 1 abaa| 1,0 abab| 1,0 abab |1,0 
abab|1,0|1 1 abab| 1,0 abba | 1,0 abba |1,0 
aaa |0, LJO 0 abbb |0, L abbba| 1,0 
aab |0, LJO 0 abbbb |0, L 
Fig.3: An observa- Fig. 4: Observation tables exposing an infinite loop when 


tion table. using the L* algorithm. 
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Example 2. In this example, we give an observation table p for the ROCA A 
from Figure 1 and the counter limit £ = 1, see Figure 3. Hence we want to learn 
BG(A) whose set of states is given by the first two levels from Figure 2. 

The first column of Â; contains the elements of RU RX such that the upper 
part is constituted by R = {e,a, ab, aba, aa} and the lower part by RX’ \ R. The 
first row contains the elements of S such that the left part is constituted by 
S = {ce} and the right part by S\S. For each element us € (RU RZ)S, we store 
the two values £p(us) and C;(us) in the intersection of row u and column s. For 
instance, these values are equal to 0, L for u = aa and s = £. For each element 
us € (RURZ)(5'\S), we have only one value £e(us) to store. Note that Pref (Ce) 
is a proper subset of Pref (Lọ). For instance, aababaa € Pref (Le) \ Pref (@e). 

Let us now explain why it is necessary to use the additional set S in Defi- 
nition 3. Assume that we only use the set S and that the current observation 
table is the leftmost table @y, with £ = 1, given in Figure 4 for the ROCA from 
Figure 1. On top of that, assume we are using the classical L* algorithm [4]. As 
we can see, the table is not closed since the stored information for abb, that is, 
0, L, does not appear in the upper part of the table for any u € R. So, we add 
abb in this upper part and abba and abbb in the lower part, to obtain the second 
table of Figure 4. Notice that this shift of abb has changed its stored information, 
which is now equal to 0,1. Indeed the set Pref (p) now contains abb as a prefix 
of abba € Le. Again, the new table is not closed because of abbb. After shifting 
this word in the upper part of the table, we obtain the third table of Figure 4. 
It is still not closed due to abbbb. This process will continue ad infinitum. 


To avoid an infinite loop when making the table closed, as described in the 
previous example, we modify both the concept of table and how to derive an 
equivalence relation from that table. Our solution is to introduce the set S, as 
already explained, but also the concept of approximation set to approximate =. 


Definition 4. Let GO; = (R,S, S, Le,Ce) be an observation table up to £. Let 
u,v E€ RU RE. Then, u € Approz(v) if and only if for alls € S, we have 
Lelus) = Le(vs) and Ce(us) A LA Ce(vs) £ L => Ce(us) = Ce(vs). The set 
Approx(v) is called an approximation set. 


In this definition, note that we consider L values as wildcards and we focus on 
words with suffixes from S' only (and not from S\ S). Interestingly, such wildcard 
entries in observation tables also feature in learning from an “inexperienced” 
teacher [25]. Just like in that work, a crucial part of our learning algorithm 
concerns how to obtain an equivalence relation from such an observation table 
(note that Approx does not define an equivalence relation as it is not transitive). 


Example 3. Let @; be the table from Figure 3. Let us compute Approz(e) (recall 
that we only consider S = {e} and not § = {a,ba}). We can see that aba ¢ 
Approx(e) as £Le(aba) = 1 and Le(e) = 0. Moreover, a ¢ Approx(e) since Ce(a) # 
L,Ce(e) A L, and Ce(a) # Ce(e). With the same arguments, we also discard 
ab, b, abb, abaa, abab. Thus, Approx(c) = {e,aa,aaa,aab}. On the other hand, 
Approz(aa) = {£, a, ab, aa, abb, aaa, aab} knowing that Cg(aa) = L. 
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The following notation will be convenient later. Let Ôp be an observation 
table and u € RU RX. If Ce(u) = L (which means that u ¢ Pref(@z)), then we 
say that u is a L-word. Let us denote by R the set R from which the |-words 
have been removed. We define RU RS» in a similar way. We can now formalize 
the relation between = and Approz. 


Proposition 2. Let Ô; be an observation table up to £ € N. Then for all u,v € 
RU RY, we have u = v = u € Approx(v). 


Closed and Consistent Observation Table. As for the L* algorithm [4], 
we need to define the constraints a table @; must respect in order to obtain a 
congruence relation from Approx and then to construct a DFA. This is more 
complex than for L*. Namely, the table must be closed, X-consistent, and L- 
consistent. The first two constraints are close to the ones already imposed by 
L*. The last one is new. Crucially, it implies that Approx is transitive. 


Definition 5. Let Ô; be an observation table up to L€ N. We say the table is: 
— closed if Vu € RX, Approz(u)N RF 9, 
— J-consistent if Vu € R, Va € X, 


ua € () Approx(va), 
ve Approx (u)AR 


— 1-consistent if Vu,v E RU RX such that u € Approz(v), 
Vs € S, Celus) AL & Ce(vs) # L. 


Example 4. Let Ĝe be the table from Figure 3. We have Appror(b) O R # Ø 
because aba € Approz(b). More generally one can check that @, is closed. 
However, Ĝp is not X-consistent. Indeed, £b ¢ [lvcApprozr(e)nR Approzx(vb) since 
Approz(e)N R = {e,aa} and eb ¢ Approx(aab). Finally, Ê is also not L- 
consistent since aa € Approx(e) but Celaa) = L and Ce(£) = 0. 


When Ô; is closed and consistent, we define the following relation =g,: 
Vu,v € RU RX, u =ọ, v & u E€ Approx(v). This relation is a congruence 
over R from which we can construct a DFA Ag,. 


Definition 6. Let Ô; be a closed, X- and L-consistent observation table up 
to L. From =6,, we define the DFA Ag, = (Qon Z, 56,3 dp, Foe) with: (1) 


Qe, = {[ul=., | u € R}, (2) 96, = lel=0,; (3) Fo. = {[ul=., | Lelu) = 1}, 
and (4) the (total) transition function ôo, is defined by ôo, ([uJ=.,,4) = [va]=.,- 
for all [uJ=,, E Qo, anda € X. 

Note that Ag, is consistent with the information stored in O,. 


Lemma 1. For allu € RU RS, we have u € L(Ao,) & u E€ Ly. 
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Making a Table Closed and Consistent. Suppose we have an observation 
table Ge up to £ and we want to make it closed, X- and L-consistent. We here 
give some intuition on how to proceed. 

If the table Ê; is not closed or not X-consistent, we proceed as in the L* 
algorithm [4]. In the first case, this means that Ju € RY’, Approx(u)N R = Q. 
It follows that u ¢ R and we thus add u to R and update the table. In the 
second case, this means that Jua € RY, Jv € Approz(u) N R, ua ¢ Approz(va). 
We have two cases: there exists s € S such that either Le(uas) # Le(vas), or 
Co(uas) £ LA^ Celvas) # LA Ce(uas) # Ce(vas). In both cases, we add as to S 
and to S and we update the table. 

Suppose that @, is not 1-consistent, i.e., Ju,v E€ RURX As € S, u € 
Approx(v) and Ce(us) A L = Ce(vs) = L. We call mismatch the latter disequal- 
ity. Let us assume, without loss of generality, that Cp(us) A L and Ce(vs) = L. 
So, us € Pref (Cg), i.e., there exist u' € RU RX and y € S such that us € 
Pref (u's') and Le(u’s') = 1. We denote by s” the word such that us” = u's’. 
The idea is to add Suff(s"”) to one or both sets S, 8. We have two cases: 


— Suppose w’ is a prefix of u. We have s” € $\ S and add Suff(s’) to S. 
— Suppose u is a proper prefix of u’. If vs” € Le then we add Suff(s’’) to S, 
otherwise we add Suff(s’’) to both S$ and S. 


The difficult task is to prove that it is always possible to make a table closed 
and consistent in finite time. 


Proposition 3. Given an observation table OG; up to £L € N, there exists an 
algorithm that makes it closed, X- and L-consistent in a finite amount of time. 


Let us give some rough intuition. Notice that R increases only when the table 
is not closed, and that $,S may increase only when the table is not consistent. 
Firstly, the number of times the table is not closed is bounded by the number of 
classes of = up to counter limit 4, by Proposition 2. Indeed, when u € RY \ R, 
witness that @, is not closed, is added to R, then it becomes the only represen- 
tative in its new approximation set. Secondly, one can prove that after resolving 
a case where the table is not consistent, then either the size of an approximation 
set decreases or a mismatch is eliminated. The number of times an approxima- 
tion set may decrease is bounded, because there are at most |R U RX] distinct 
such sets whose size is bounded by |RU RX]. Finally, the number of mismatches 
to eliminate is also bounded. Hard work was necessary to get this result as, when 
one mismatch is eliminated when solving a case where the table is not consistent, 
S may increase, inducing the creation of new mismatches. 


Handling Counterexamples to Partial Equivalence Queries. Let Ag, be 
the DFA constructed from a closed, X- and -consistent observation table @,. 
If the teacher’s answer to a partial equivalence query over Ag, is positive, then 
Ao, exactly accepts Le. Otherwise, the teacher returns a counterexample, that 
is, a word w E€ X* such that w € Ly & w ¢ L(A@w,). In the latter case, we add 
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Pref(w) to R and update the table. We finally make the new table @; closed, 
X- and |-consistent. We have that =o is a strict refinement of =,. 


Proposition 4. For all u,v E€ RU RY, we have u=o, v > u =o, v. Further- 
more, the index of =o, is strictly greater than the index of =6,. 


Since the number of classes of = up to counter limit Z is bounded by the 
width K and the £+ 1 levels of BG,(.A), by Propositions 2 to 4, we deduce 
that after a finite number of steps, we obtain an observation table Ôp and its 
corresponding DFA Ag, such that £(Aje,) = Le. 


Algorithm 1 Learning an ROCA 


Require: A teacher knowing an ROCA A 
Ensure: An ROCA accepting the same language is returned 

1: Initialize the observation table @ with £ = 0, R = S = S = {e} 
2: while true do 

3: Make @, closed, X-, and 1-consistent 

A: Construct the DFA Ao, from ĝe 
5: Ask a partial equivalence query over Ae, 
6: if the answer is negative then 
7 
8 


Update @: with the provided counterexample > £ is not modified 
: else 
9: Identify all periodic descriptions a1,...,Qn of Ae, 
10: Construct an ROCA Aa, for each a; 
11: Ask an equivalence query over each Aa; 
12: if the answer is true for an Aa, then return Aa, 
13: else Select one counterexample and update Âg > £ is increased 


Learning Algorithm. We have every piece needed to give the learning algo- 
rithm for ROCAs, as presented in Algorithm 1. We initialize the observation 
table @; with £ = 0, R = S = S = {ce}. Then, we make the table closed, X-, and 
t-consistent, construct the DFA Ag,, and ask for a partial equivalence query 
with Ajo,. If the teacher answers positively, we have learned a DFA accepting 
Le. Otherwise, we use the provided counterexample to update the table with- 
out increasing l. Once the learned DFA Ag, accepts the language Le, the next 
proposition states that the initial fragments (up to a certain counter limit) of 
both Ag, and BG(A) are isomorphic. This means that, once we have learned 
a long enough initial fragment, we can extract a periodic description from Ag, 
that is valid for BG(A). 


Proposition 5. Let BG(A) be the behavior graph of an ROCA A, K be its 
width, and m, k be the offset and the period of a periodic description of BG(A). 
Lets=m+(K- k)*. Let Ô; be a closed, X- and L-consistent observation table 
up to L > s such that L(Ao,) = Le. Then, the trim parts of the subautomata of 
BG(A) and Ae, restricted to the levels in {0,...,€— s} are isomorphic. 


Learning Realtime One-Counter Automata 255 


Hence we extract all possible periodic descriptions a from Ag,. By Proposi- 
tion 1, each description a yields an ROCA Ag on which we ask for an equivalence 
query. If the teacher answers positively, we have learned an ROCA accepting L 
and we are done. Otherwise, we need to increase the counter limit and update 
the table using some of the counterexamples provided by the teacher. 

Extracting every possible periodic description of Ag, can be performed by 
identifying an isomorphism between two consecutive subgraphs of Ag,. That is, 
we fix values for the offset m and period k and see if the subgraphs induced by 
the levels m to m + k — 1, and by the levels m + k to m+ 2k — 1 are isomorphic 
(this means considering all pairs (m, k) such that m + 2k — 1 < £). This can be 
done by executing two depth-first searches in parallel [27]. Note that multiple 
periodic descriptions may be found, due to the finite knowledge of the learner. 

In case all ROCAs constructed from Ag, do not accept L, we handle the 
counterexamples returned by the teacher as follows. If among them, there is one 
counterexample, say w, such that the height ha(w) exceeds £, we add Pref (w) 
to R (as in the case of a negative partial equivalence query) and the new counter 
limit is updated to ha(w). If none of the counterexamples have an height exceed- 
ing £ (this may happen due to the limited knowledge of the learner), we instead 
use Ag, directly as an ROCA and ask an equivalence query. Since £L(Ag,) = Le 
(as the last partial equivalence query was true), the counterexample returned by 
the teacher necessarily has a high enough height and we proceed as above. 


Complexity of the Algorithm. Let us briefly explain the complexity an- 
nounced in Theorem 2 for Algorithm 1 in terms of |Q| the number of states of 
the given ROCA and ¢t the length of the longest counterexample returned by 
the teacher. The given bound on the number of (partial) equivalence queries is 
obtained by arguments similar to those of [27]. The number of steps in the main 
loop of Algorithm 1 is the (polynomial) number of partial equivalence queries. 
During one step in this loop, by carefully studying how we make the table closed 
and consistent and handle a counterexample, we get that RURY (resp. S) grows 
linearly (resp. exponentially) in |Q], |X|, and t. We also get an exponential num- 
ber of membership and equivalence queries for the whole algorithm. 


4 Experiments 


We evaluated our algorithm on two types of benchmarks. The first uses randomly 
generated ROCAs, while the second focuses on a new approach to learn an 
ROCA that can efficiently check if a JSON document is valid against a given 
JSON schema. Notice that while there exist several algorithms that infer a JSON 
schema from a collection of JSON documents (see survey [5]), none are based on 
learning techniques nor do they yield an automaton-based validation algorithm. 

The ROCAs and the learning algorithm were implemented by extending the 
well-known Java libraries AUTOMATALIB and LEARNLIB [20]. These modifica- 
tions can be consulted on [31,33], while the code for the benchmarks is available 
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on [32]. Implementation specific details (such as the libraries) are given along- 
side the code. The server used for the computations ran Debian 10 over Linux 
5.4.73-1-pve with a 4-core Intel(R) Xeon®) Silver 4214R Processor with 16.5M 
cache, and 64GB of RAM. Moreover, we used OpenJDK version 11.0.12. 


4.1 Random ROCAs 


We first discuss our benchmarks based on randomly generated ROCAs. 


Random Generation of ROCAs. An ROCA with given size n = |Q| is 
randomly generated such that (1) Vq € Q,q has a probability 0.5 of being final, 
and (2) Vq E€ Q,Va € X, ô>o(q,a) = (p,c) with p a random state in Q and c 
a random counter operation in {—1,0,+1}. We define d-9(q,a) = (p,c) ina 
similar way except that c € {0,+1}. All random draws are assumed to come 
from a uniform distribution. Since this generation does not guarantee an ROCA 
with n reachable states, we generate 100 ROCAs and select the ROCA with a 
maximal number of reachable states. However, it is still possible the resulting 
ROCA does not have n (co)-reachable states. 


Equivalence of Two ROCAs. The language equivalence problem of ROCAs 
is known to be decidable and NL-complete [8]. Unfortunately, the algorithm de- 
scribed in [8] is difficult to implement. Instead, we use an “approximate” equiva- 
lence oracle for our experiments.’ Let A and B be two ROCAs such that B is the 
learned ROCA from a periodic description with period k. The algorithm explores 
the configuration space of both ROCAs in parallel. If, at some point, it reaches 
a pair of configurations such that one is accepting and the other not, then we 
have a counterexample. However, to have an algorithm that eventually stops, we 
need to bound the counter value of the configurations to explore. Our approach 
is to first explore up to counter value |A x B|? (in view of [8, Proposition 18] 
about shortest accepting runs in an ROCA). If no counterexample is found, we 
add k to the bound and, with probability 0.5, a new exploration is done up to 
the new bound. We repeat this whole process until we find a counterexample or 
until the random draw forces us to stop. 


Results. For our random benchmarks, we let the size |Q| of the ROCA vary 
between one and five, and the size of |X| of the alphabet between one and four. 
For each pair (|Q], |X|) of sizes, we execute the learning algorithm on 100 ROCAs 
(generated as explained above). We set a timeout of 20 minutes and a memory 
limit of 16GB. The number of executions with a timeout is given in Table 1 (we 
do not give the pairs (|Q|, |X|) where every execution could finish). 

The mean of the total time taken by the algorithm is given in Figure 5a. One 
can see that it has an exponential growth in both sizes |Q| and |X|. Note that 


3 The teacher might, with some small probability, answer with false positives but never 
with false negatives. 
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Table 1: Number (over 100) of executions with a timeout (TO). The executions 
for the missing pairs (|Q|, |X|) could all finish. 


iQ} |S] TO (20 min) 


aan RRA 
RwnrAwnr 


executions with a timeout had their execution time set to 20 minutes, in order 
to highlight the curve. Let us now drop all the executions with a timeout. The 
mean length of the longest counterexample provided by the teacher for (partial) 
equivalence queries is presented in Figure 5b and the final size of the sets R and 
S is presented in Figures 5c and 5d. Note that the curves go down due to the 
limited number of remaining executions (for instance, the ones that could finish 
did not require long counterexamples). We can see that S grows larger than R, 
which is coherent with the theoretical results stated at the end of Section 3. 


Length t 


(a) Mean of the total time taken by the (b) Mean of the length t of the longest 
learning algorithm. counterexample. 


Ea 2 
= s 5007 i 
S s 1 4° 
n n p 2 3 412°. st 
Phabet size 30° 
(c) Mean of the final size of R. (d) Mean of the final size of S. 


Fig. 5: Results for the benchmarks based on random ROCAs. 
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4.2 JSON Documents and JSON Schemas 


Let us now discuss the second set of benchmarks, which constitutes a proof-of- 
concept for an efficient validator of JSON-document [11] streams and is inspired 
by [13]. This format is currently the most popular one for exchanging information 
on the web. Constraints over documents can be described by a JSON schema [22] 
(like DTDs do for XML documents). See [22,21] for a brief overview of JSON 
documents and schemas. 

In our learning process, the learner aims to construct an ROCA that can 
validate a JSON document, according to the schema. We assume the teacher 
knows the target schema and the queries are specialized as follows: (1) Mem- 
bership queries: the learner provides a JOON document and the teacher answers 
true if the document is valid for the schema. (2) Counter value queries: the 
learner provides a JSON document and the teacher returns the number of un- 
matched { and [. Adding the two values is a heuristic abstraction that allows us 
to summarize two-counter information into a single counter value. Importantly, 
the abstraction is a design choice regarding our implementation of a teacher 
for these experiments and not an assumption made by our learning algorithm. 
(3) Partial equivalence query: the learner provides a DFA and a counter limit £. 
The teacher randomly generates an a-priori fixed number of documents with a 
height not exceeding the counter limit @ and checks whether the DFA and the 
schema both agree on the documents’ validity. If a disagreement is noticed, the 
incorrectly classified document is returned. (4) Equivalence query: the learner 
provides an ROCA. It is very similar to partial equivalence queries, except that 
documents are generated without a bound on the height. Note that the random- 
ness of the (partial) equivalence queries implies that the learned ROCA may not 
completely recognize the same set of documents as for the schema. 

In order for an ROCA to be learned in a reasonable time, some abstractions 
are made mainly to reduce the alphabet size: (1) If an object has a key named 
key, we consider the sequence of characters "key" as a single alphabet symbol. 
(2) Strings, integers, and numbers are abstracted as "\S", "\I", and "\D" re- 
spectively. Booleans are left unmodified. (3) The symbols ,, {, }, [, ], : are 
all considered as different alphabet symbols. (4) We assume each object is com- 
posed of an ordered (instead of unordered) collection of pairs key-value. Note 
that the learning algorithm can learn without these restrictions but it requires 
substantially more time, due to a blowup in the state space or in the alphabet. 

Moreover, notice that the alphabet is not known at the start of the learning 
process (due to the fact that keys can be any strings). Therefore we slightly 
modify the learning algorithm to support growing alphabets. More precisely, the 
learner’s alphabet starts with the symbols { and } (to guarantee we can at least 
produce a syntactically valid JSON document for the first partial equivalence 
query) and is augmented each time a new symbol is seen. 


Results. We considered three JSON schemas. The first is a simple document 
listing all possible values (i.e., it contains an integer, a double, and so on). The 


Learning Realtime One-Counter Automata 259 


Table 2: Results for JSON benchmarks. 


Schema TO (1h) Time (s) t |R| || |A| |X| 
1 0 16.39 31.00 55.55 32.00 33.00 19.00 
2 27 1045.64 12.99 57.84 33.74 44.29 14.70 
3 19 922.19 49.49 171.94 50.49 51.16 9.00 


second is a real-world JSON schema‘ used by a code coverage tool called Code- 
cov [14]. Finally, the third schema encodes a recursive list, i.e., an object con- 
taining a list with at most one object defined recursively. This last example is 
used to force the behavior graph to be infinite. 

Table 2 gives the results of the benchmarks, obtained by fixing the num- 
ber of random documents by (partial) equivalence query to be 1000. For each 
schema, 100 executions were ran with a time limit of one hour and a mem- 
ory limit of 16GB by execution. We can see that real-world JSON schemas and 
recursively-defined schemas can be both learned by our approach. One last inter- 
esting statistics is that |R] is larger than |S|, unlike for the random benchmarks. 


5 Future Work 


As future work, we believe one might be able to remove the use of partial equiva- 
lence queries. In this direction, perhaps replacing our use of Neider and Löding’s 
VCA algorithm by Isberner’s TTT algorithm [19] for visibly pushdown automata 
might help. Indeed, the TTT algorithm does not need partial equivalence queries. 

Another interesting direction concerns lowering the (query) complexity of 
our algorithm. In [29], it is proved that L* algorithm [4] can be modified so that 
adding a single separator after a failed equivalence query is enough to update the 
observation table. This would remove the suffix-closedness requirements on the 
separator sets S and S. It is not immediately clear to us whether the definition 
of L-consistency presented here holds in that context. Further optimizations, 
such as discrimination tree-based algorithms (see e.g. Kearns and Vazirani’s 
algorithm [24]), also do not need the separator set to be suffix-closed. 

It would also be interesting to directly learn the one-counter language instead 
of an ROCA. Indeed, our algorithm learns some ROCA that accepts the target 
language. It would be desirable to learn some canonical representation of the 
language (e.g. a minimal automaton, for some notion of minimality). 

Finally, as far as we know, there currently is no active learning algorithm for 
deterministic one-counter automata (such that ¢-transitions are allowed). We 
want to study how we can adapt our learning algorithm in this context. 


4 We downloaded the schema from the JSON Schema Store [23]. We modified the file 
to remove all constraints of type “enum”. 


260 


V. Bruyère et al. 


References 


1. 


10. 


Aarts, F., Jonsson, B., Uijen, J., Vaandrager, F.W.: Generating models of infinite- 
state communication protocols using regular inference with abstraction. Formal 
Methods Syst. Des. 46(1), 1-41 (2015). https://doi.org/10.1007/s10703-014-0216- 
x, https: //doi.org/10.1007/s10703-014-0216-x 

Abel, A., Reineke, J.: Gray-box learning of serial compositions of mealy ma- 
chines. In: Rayadurgam, S., Tkachuk, O. (eds.) NASA Formal Methods - 
8th International Symposium, NFM 2016, Minneapolis, MN, USA, June 7-9, 
2016, Proceedings. Lecture Notes in Computer Science, vol. 9690, pp. 272-287. 
Springer (2016). https://doi.org/10.1007/978-3-319-40648-0_21, https://doi.org/ 
10.1007 /978-3-319-40648-0_21 

Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. 
Addison-Wesley series in computer science / World student series edition, Addison- 
Wesley (1986), https://www.worldcat.org/ocle/12285707 

Angluin, D.: Learning regular sets from queries and counterexamples. Inf. Comput. 
75(2), 87-106 (1987). https://doi.org/10.1016 /0890-5401(87)90052-6, https: //doi. 
org/10.1016/0890-5401(87)90052-6 

Baazizi, M.A., Colazzo, D., Ghelli, G., Sartiani, C.: Schemas and types 
for JSON data. In: Herschel, M., Galhardas, H., Reinwald, B., Fundu- 
laki, I., Binnig, C., Kaoudi, Z. (eds.) Advances in Database Technology 
- 22nd International Conference on Extending Database Technology, EDBT 
2019, Lisbon, Portugal, March 26-29, 2019. pp. 437-439. OpenProceedings.org 
(2019). https://doi.org/10.5441/002/edbt.2019.39, https://doi.org/10.5441/002/ 
edbt.2019.39 

Berman, P., Roos, R.: Learning one-counter languages in polynomial time (ex- 
tended abstract). In: 28th Annual Symposium on Foundations of Computer Sci- 
ence, Los Angeles, California, USA, 27-29 October 1987. pp. 61-67. IEEE Com- 
puter Society (1987). https://doi.org/10.1109/SFCS.1987.36, https://doi.org/10. 
1109/SFCS.1987.36 

Berthon, R., Boiret, A., Pérez, G.A., Raskin, J.: Active learning of sequential 
transducers with side information about the domain. In: Moreira, N., Reis, R. (eds.) 
Developments in Language Theory - 25th International Conference, DLT 2021, 
Porto, Portugal, August 16-20, 2021, Proceedings. Lecture Notes in Computer 
Science, vol. 12811, pp. 54-65. Springer (2021). https://doi.org/10.1007/978-3-030- 
81508-0_5, https://doi.org/10.1007/978-3-030-81508-0_5 

Bohm, S., Göller, S., Jancar, P.: Bisimulation equivalence and regularity for 
real-time one-counter automata. J. Comput. Syst. Sci. 80(4), 720-743 (2014). 
https://doi.org/10.1016/j.jcss.2013.11.003, https: //doi.org/10.1016/j.jcss.2013.11. 
003 

Bollig, B.: One-counter automata with counter observability. In: Lal, A., 
Akshay, S., Saurabh, S., Sen, S. (eds.) 36th IARCS Annual Confer- 
ence on Foundations of Software Technology and Theoretical Computer 
Science, FSTTCS 2016, December 13-15, 2016, Chennai, India. LIPIcs, 
vol. 65, pp. 20:1-20:14. Schloss Dagstuhl - Leibniz-Zentrum für Infor- 
matik (2016). https://doi.org/10.4230/LIPIcs.FSTTCS.2016.20, https://doi.org/ 
10.4230/LIPIcs.FSTTCS.2016.20 

Bouajjani, A., Bozga, M., Habermehl, P., Iosif, R., Moro, P., Vojnar, T.: Pro- 
grams with lists are counter automata. Formal Methods Syst. Des. 38(2), 158- 
192 (2011). https://doi.org/10.1007/s10703-011-0111-7, https://doi.org/10.1007/ 
s10703-011-0111-7 


11. 


12. 


13. 


14. 
15. 


16. 


17. 


18. 


19. 


20. 


21. 
22) 
23. 
24. 


25. 


Learning Realtime One-Counter Automata 261 


Bray, T.: The javascript object notation (JSON) data interchange format. RFC 
8259, 1-16 (2017). https://doi.org/10.17487/RFC8259, https://doi.org/10.17487/ 
RFC8259 

Bruyère, V., Pérez, G.A., Staquet, G.: Learning realtime one-counter automata. 
CoRR abs/2110.09434 (2021), https://arxiv.org/abs/2110.09434 

Chitic, C., Rosu, D.: On validation of XML streams using finite state machines. 
In: Amer-Yahia, S., Gravano, L. (eds.) Proceedings of the Seventh International 
Workshop on the Web and Databases, WebDB 2004, June 17-18, 2004, Maison de 
la Chimie, Paris, France, Colocated with ACM SIGMOD/PODS 2004. pp. 85-90. 
ACM (2004). https://doi.org/10.1145/1017074.1017096, https://doi.org/10.1145/ 
1017074.1017096 

Codecov, https://about.codecov.io/ 

Fahmy, A.F., Roos, R.S.: Efficient learning of real time one-counter automata. 
In: Jantke, K.P., Shinohara, T., Zeugmann, T. (eds.) Algorithmic Learning 
Theory, 6th International Conference, ALT ’95, Fukuoka, Japan, October 18- 
20, 1995, Proceedings. Lecture Notes in Computer Science, vol. 997, pp. 25- 
40. Springer (1995). https://doi.org/10.1007/3-540-60454-5_26, https: //doi.org/10. 
1007 /3-540-60454-5_26 

Garhewal, B., Vaandrager, F.W., Howar, F., Schrijvers, T., Lenaerts, T., Smits, 
R.: Grey-box learning of register automata. In: Dongol, B., Troubitsyna, E. (eds.) 
Integrated Formal Methods - 16th International Conference, IFM 2020, Lugano, 
Switzerland, November 16-20, 2020, Proceedings. Lecture Notes in Computer Sci- 
ence, vol. 12546, pp. 22-40. Springer (2020). https://doi.org/10.1007/978-3-030- 
63461-22, https://doi.org/10.1007/978-3-030-63461-2_2 

Groce, A., Peled, D.A., Yannakakis, M.: Adaptive model checking. Log. J. IGPL 
14(5), 729-744 (2006). https://doi.org/10.1093/jigpal/jzl007, https://doi.org/10. 
1093/jigpal/jzl007 

Hopcroft, J.E., Ullman, J.D.: Introduction to Automata Theory, Languages and 
Computation, Second Edition. Addison-Wesley (2000) 

Isberner, M., Howar, F., Steffen, B.: The TTT algorithm: A redundancy-free ap- 
proach to active automata learning. In: Bonakdarpour, B., Smolka, S.A. (eds.) 
Runtime Verification - 5th International Conference, RV 2014, Toronto, ON, 
Canada, September 22-25, 2014. Proceedings. Lecture Notes in Computer Science, 
vol. 8734, pp. 307-322. Springer (2014). https://doi.org/10.1007/978-3-319-11164- 
3_26, https://doi-org/10.1007/978-3-319-11164-3_26 

Isberner, M., Howar, F., Steffen, B.: The open-source learnlib - A framework 
for active automata learning. In: Kroening, D., Pasareanu, C.S. (eds.) Computer 
Aided Verification - 27th International Conference, CAV 2015, San Francisco, CA, 
USA, July 18-24, 2015, Proceedings, Part I. Lecture Notes in Computer Science, 
vol. 9206, pp. 487-495. Springer (2015). https://doi.org/10.1007/978-3-319-21690- 
4 32, https://doi.org/10.1007/978-3-319-21690-4_32 

Json.org, https://www.json.org 

Json schema, https: //json-schema.org 

Json schema store, https://www.schemastore.org/json/ 

Kearns, M.J., Vazirani, U.V.: An _ Introduction to Computational 
Learning Theory. MIT Press (1994), https://mitpress.mit.edu/books/ 
introduction-computational-learning-theory 

Leucker, M., Neider, D.: Learning minimal deterministic automata from in- 
experienced teachers. In: Margaria, T., Steffen, B. (eds.) Leveraging Applica- 
tions of Formal Methods, Verification and Validation. Technologies for Mastering 


262 V. Bruyère et al. 


Change - 5th International Symposium, ISoLA 2012, Heraklion, Crete, Greece, 
October 15-18, 2012, Proceedings, Part I. Lecture Notes in Computer Science, 
vol. 7609, pp. 524-538. Springer (2012). https://doi.org/10.1007/978-3-642-34026- 
0-39, https://doi.org/10.1007/978-3-642-34026-0_39 

26. Michaliszyn, J., Otop, J.: Learning deterministic automata on infinite words. In: 
Giacomo, G.D., Catala, A., Dilkina, B., Milano, M., Barro, S., Bugarín, A., Lang, J. 
(eds.) ECAI 2020 - 24th European Conference on Artificial Intelligence, 29 August- 
8 September 2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 
- Including 10th Conference on Prestigious Applications of Artificial Intelligence 
(PAIS 2020). Frontiers in Artificial Intelligence and Applications, vol. 325, pp. 
2370-2377. IOS Press (2020). https: //doi.org/10.3233/FAIA 200367, https://doi. 
org/10.3233/FAIA 200367 

27. Neider, D., Loding, C.: Learning visibly one-counter automata in polynomial time. 
Tech. rep., Technical Report AIB-2010-02, RWTH Aachen (January 2010) (2010) 

28. Peled, D.A., Vardi, M.Y., Yannakakis, M.: Black box checking. J. Autom. Lang. 
Comb. 7(2), 225-246 (2002). https: //doi.org/10.25596/jalc-2002-225, https://doi. 
org /10.25596 /jalc- 2002-225 

29. Rivest, R.L., Schapire, R.E.: Inference of finite automata using homing sequences. 
Inf. Comput. 103(2), 299-347 (1993). https://doi.org/10.1006/inco.1993.1021, 
https://doi.org/10.1006/inco.1993.1021 

30. Roos, R.S.: Deciding equivalence of deterministic one-counter automata in poly- 
nomial time with applications to learning (1988) 

31. Staquet, G.: Automatalib fork for rocas, https://github.com/DocSkellington/ 
automatalib 

32. Staquet, G.: Code for the benchmarks for roca learning, https://github.com/ 
DocSkellington/LStar- ROCA- Benchmarks 

33. Staquet, G.: Learnlib fork for rocas, https://github.com/DocSkellington/Learnlib 

34. Valiant, L.G., Paterson, M.: Deterministic one-counter automata. J. Comput. Syst. 
Sci. 10(3), 340-350 (1975). https: //doi.org/10.1016/S0022-0000(75)80005-5, https: 
//doi.org/10.1016/S0022-0000(75)80005-5 


Open Access This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if changes were 
made. 

The images or other third party material in this chapter are included in the chapter’s 
Creative Commons license, unless indicated otherwise in a credit line to the material. If 
material is not included in the chapter’s Creative Commons license and your intended 
use is not permitted by statutory regulation or exceeds the permitted use, you will need 
to obtain permission directly from the copyright holder. 


2, ` 
C; D 
Via “sable 


Scalable Anytime Algorithms for 
Learning Fragments of Linear Temporal Logic* 


Ritam Raha!?(®)@®, Rajarshi Roy?®, Nathanaél Fijalkow?4@, and Daniel 
Neider?® 


1 University of Antwerp, Antwerp, Belgium 
ritam.raha@uantwerpen.be 
2 CNRS, LaBRI and Université de Bordeaux, France 
nathanael.fijalkow@labri.fr 
3 Max Planck Institute for Software Systems, Kaiserslautern, Germany 
{rajarshi,neider}@mpi-sws. org 
4 The Alan Turing Institute of data science, United Kingdom 


Abstract. Linear temporal logic (LTL) is a specification language for 
finite sequences (called traces) widely used in program verification, mo- 
tion planning in robotics, process mining, and many other areas. We 
consider the problem of learning formulas in fragments of LTL without 
the U-operator for classifying traces; despite a growing interest of the re- 
search community, existing solutions suffer from two limitations: they do 
not scale beyond small formulas, and they may exhaust computational 
resources without returning any result. We introduce a new algorithm ad- 
dressing both issues: our algorithm is able to construct formulas an order 
of magnitude larger than previous methods, and it is anytime, meaning 
that it in most cases successfully outputs a formula, albeit possibly not 
of minimal size. We evaluate the performances of our algorithm using an 
open source implementation against publicly available benchmarks. 


Keywords: Linear Temporal Logic - Artificial Intelligence - Specifica- 
tion Mining 


1 Introduction 


Linear Temporal Logic (LTL) is a prominent logic for specifying temporal prop- 
erties [20] over infinite traces, and recently introduced over finite traces [6]. In 
this paper, we consider finite traces but, in a small abuse of notations, call this 
logic LTL as well. It has become a de facto standard in many fields such as model 
checking, program analysis, and motion planning for robotics. Over the past five 
to ten years learning temporal logics (of which LTL is the core) has become an 
active research area and identified as an important goal in artificial intelligence: 
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it formalises the difficult task of building explainable models from data. Indeed, 
as we will see in the examples below and as argued in the literature, e.g., by 
[4] and [24], LTL formulas are typically easy to interpret by human users and 
therefore useful as explanations. The variable free syntax of LTL and its natural 
inductive semantics make LTL a natural target for building classifiers separating 
positive from negative traces. 

The fundamental problem we study here, established in [25], is to build an 
explainable model in the form of an LTL formula from a set of positive and neg- 
ative traces. More formally (we refer to the next section for formal definitions), 
given a set U1,...,Un of positive traces and a set v1,...,Un of negative traces, 
the goal is to construct a formula y of LTL which satisfies all u;’s and none of 
the v;’s. In that case, we say that y is a separating formula or—using machine 
learning terminology—a classifier. 

To make things concrete let us introduce our running example, a classic mo- 
tion planning problem in robotics and inspired by [15]. A robot collects wastebin 
contents in an office-like environment and empties them in a trash container. Let 
us assume that there is an office o, a hallway h, a container c and a wet area 
w. The following are possible traces obtained in experimentation with the robot 
(for instance, through simulation): 


uj =h-h-h-h-o-h-c-h 
vı =h-h-h-h-h-c-h-o-h-h 


In LTL learning we start from these labelled data: given u1 as positive and vı 
as negative, what is a possible classifier including u1 but not vı? Informally, vı 
being negative implies that the order is fixed: o must be visited before c. We 
look for classifiers in the form of separating formulas, for instance 


F(oA F Xo), 


where the F-operator stands for “finally” and X for “next”. Note that this 
formula requires to visit the office first and only then visit the container. 
Assume now that two more negative traces were added: 


vg =h-h-h-h-h-o-w-ce-h-h-h 
v3 =h-h-h-h-h-w-o-w-e-w-w 


Then the previous separating formula is no longer correct, and a possible sepa- 
rating formula is 
F(oAF Xc) A Gw), 


which additionally requires the robot to never visit the wet area. Here the G- 
operator stands for “globally”. 

Let us emphasise at this point that for the sake of presentation, we con- 
sider only exact classifiers: a separating formula must satisfy all positive traces 
and none of the negative traces. However, our algorithm naturally extends to 
the noisy data setting where the goal is to construct an approximate classifier, 
replacing ‘all’ and ‘none’ by ‘almost all’ and ‘almost none’. 
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State of the art. A number of different approaches have been proposed, lever- 
aging SAT solvers [19], automata [4], and Bayesian inference [16], and extended 
to more expressive logics such as Property Specification Language (PSL) [24] 
and Computational Tree Logic (CTL) [9]. 

Applications include program specification [17], anomaly and fault detec- 
tion [3], robotics [5], and many more: we refer to [4], Section 7, for a list of 
practical applications. An equivalent point of view on LTL learning is as a speci- 
fication mining question. The ARSENAL [13] and FRET [14] projects construct 
LTL specifications from natural language, we refer to [18] for an overview. 

Existing methods do not scale beyond formulas of small size, making them 
hard to deploy for industrial cases. A second serious limitation is that they often 
exhaust computational resources without returning any result. Indeed theoretical 
studies [11] have shown that constructing the minimal LTL formula is NP-hard 
already for very small fragments of LTL, explaining the difficulties found in 
practice. 


Our approach. To address both issues, we turn to approximation and any- 
time algorithms. Here approximation means that the algorithm does not ensure 
minimality of the constructed formula: it does ensure that the output formula 
separates positive from negative traces, but it may not be the smallest one. On 
the other hand, an algorithm solving an optimisation problem is called anytime 
if it finds better and better solutions the longer it keeps running. In other words, 
anytime algorithms work by refining solutions. As we will see in the experiments, 
this implies that even if our algorithm timeouts it may yield some good albeit 
non-optimal formula. 

Our algorithm targets a strict fragment of LTL, which does not contain the 
Until operator (nor its dual Release operator). It combines two ingredients: 


— Searching for directed formulas: we define a space efficient dynamic program- 
ming algorithm for enumerating formulas from a fragment of LTL that we 
call Directed LTL. 

— Combining directed formulas: we construct two algorithms for combining 
formulas using Boolean operators. The first is an off-the-shelf decision tree 
algorithm, and the second is a new greedy algorithm called Boolean subset 
cover. 


The two ingredients yield two subprocedures: the first one finds directed for- 
mulas of increasing size, which are then fed to the second procedure in charge 
of combining them into a separating formula. This yields an anytime algorithm 
as both subprocedures can output separating formulas even with a low compu- 
tational budget and refine them over time. 


Let us illustrate the two subprocedures in our running example. The first 
procedure enumerates so-called directed formulas in increasing size; we refer to 
the corresponding section for a formal definition. The directed formulas F(o ^ 
F X c) and G(-w) have small size hence will be generated early on. The second 
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procedure constructs formulas as Boolean combinations of directed formulas. 
Without getting into the details of the algorithms, let us note that both F(o A 
F Xc) and G(Aw) satisfy u1. The first does not satisfy vı and the second does 
not satisfy vg and v3. Hence their conjunction F(oAF X c)\G(-w) is separating, 
meaning it satisfies u; but none of v1, v2, V3. 


Outline. The mandatory definitions and the problem statement we deal with 
are described in Section 2. Section 3 shows a high-level overview of our main idea 
in the algorithm. The next two sections, Section 4 and Section 5 describe the 
two phases of our algorithm in details, in one section each. We discuss the theo- 
retical guarantees of our algorithm in Section 6. We conclude with an empirical 
evaluation in Section 7. 


2 Preliminaries 


Traces. Let P be a finite set of atomic propositions. An alphabet is a finite 
non-empty set X = 2”, whose elements are called symbols. A finite trace over X 
is a finite sequence t = a142 . . . an such that for every 1 < i < n, a; € X. We say 
that t has length n and write |t| = n. For example, let P = {p,q}, in the trace 
t = {p,q} - {p}: {q} both p and q hold at the first position, only p holds in the 
second position, and q in the third position. Note that, throughout the paper, 
we only consider finite traces. 

A trace is a word if exactly one atomic proposition holds at each position: 
we used words in the introduction example for simplicity, writing h -o-c instead 
of {h} - {o} - {c}. 

Given a trace t = aja2...a@, and 1 < i < j < n, let t[i, j] =a;...a; be the 
infix of t from position i up to and including position j. Moreover, t[i] = a; is 
the symbol at the it” position. 


Linear Temporal Logic. The syntax of Linear Temporal Logic (LTL, in short) 
is defined by the following grammar 


p:=pEeP|-plpvyleaAy|Xo|Fe|Gy|eUuy 


We use the standard formulas: true = p V 7p, false = pA 7p and last = 
~X true, which denotes the last position of the trace. As a shorthand, we use 
n 
X" p for X...X yp. 
n times 
The size of a formula is the size of its underlying syntax tree. 


Formulas in LTL are evaluated over finite traces. To define the semantics of 
LTL we introduce the notation t,i = y, which reads ‘the LTL formula y holds 
over trace t from position i’. We say that t satisfies y and we write t = p when 
t,1 = y. The definition of — is inductive on the formula ¢: 


- t,i H pEP ifpe til. 
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- t,t Xo ifé < |t| and t,i +1 H vy. It is called the neXt operator. 

t,i H Fọ if t,i’ H ¢ for some 7’ € [i, |t|]. It is called the eventually operator 
(F comes from Finally). 

- t,i Gy if t,i' H ọ for all i’ € [i, |t|]. It is called the Globally operator. 
t,i = yU y ift, j H y for some i < j < |t| and t,i’ H ọ for all i < i’ < j. It 
is called the Until operator. 


The LTL Learning Problem. The LTL exact learning problem studied in this 
paper is the following: given a set P of positive traces and a set N of negative 
traces, construct a minimal LTL separating formula y, meaning such that t = y 
for all t € P and t fy for all te N. 

There are two relevant parameters for a sample: its size, which is the number 
of traces, and its length, which is the maximum length of all traces. 

The problem is naturally extended to the LTL noisy learning problem where 
the goal is to construct an ¢-separating formula, meaning such that y satisfies 
all but an € proportion of the traces in P and none but an € proportion of 
the traces in N. For the sake of simplicity we present an algorithm for solving 
the LTL exact learning problem, and later sketch how to extend it to the noisy 
setting. 


3 High-level view of the algorithm 


Let us start with a naive algorithm for the LTL Learning Problem. We can search 
through all LTL formulas in some order and check whether they are separating 
for our sample or not. Checking whether an LTL formula is separating can be 
done using standard methods (for e.g. using bit vector operations [2]). However, 
the major drawback of this idea is that we have to search through all LTL 
formulas, which is hard as the number of LTL formulas grows very quickly°. 

To tackle this issue, instead of the entire LTL fragment, our algorithm (as 
outlined in Algorithm 1) performs an iterative search through a fragment of LTL, 
which we call Directed LTL (Line 4). We expand upon this in Section 4. In that 
section, we also describe how we can iteratively generate these Directed LTL 
formulas in a particular “size order” (not the usual size of an LTL formula) and 
evaluate these formulas over the traces in the sample efficiently using dynamic 
programming techniques. 

To include more formulas in our search space, we generate and search through 
Boolean combinations of the most promising formulas of Directed LTL formulas 
(Line 11), which we describe in detail in Section 5. Note that, the fragment of 
LTL that our algorithm searches through ultimately does not include formulas 
with U operator. Thus, for readability, we use LTL to refer to the fragment 
LTL \ U in the rest of the paper. 

During the search of formulas, our algorithm searches for smaller separating 
formulas at each iteration than the previously found ones, if any. In fact, as a 


J14.7* [12] 


The number of LTL formulas of size k is asymptotically equivalent to sa 
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Algorithm 1 Overview of our algorithm 


1: Beg 
2: w + Ø: best formula found 
3: for all s in “size order” do 


4 D + all Directed LTL formulas of parameter s 
5 for all y € D do 
6: if y is separating and smaller than 7 then 
7 pep 
8: end if 
9: end for 
10: B+} BUD 
11: B+ Boolean combinations of the promising formulas in B 
12: for all y € B do 
13: if y is separating and smaller than 7 then 
14: wep 
15: end if 
16: end for 
17: end for 
18: Return w 


heuristic, once a separating formula is found, we only search through formulas 
that are smaller than the found separating formula. Such a heuristic, along with 
aiding the search for minimal formulas, also reduces the search space signifi- 
cantly. 


Anytime property. The anytime property of our algorithm is also consequence 
of storing the smallest formula seen so far ((Line 7 and 14)). Once we find a sep- 
arating formula, we can output it and continue the search for smaller separating 
formulas. 


Extension to the noisy setting. The algorithm is seamlessly extended to the 
noisy setting by rewriting lines 6 and 13: instead of outputting only separating 
formulas, we output €-separating formulas. 


4 Searching for directed formulas 


The first insight of our approach is the definition of a fragment of LTL that we 
call directed LTL. 

A partial symbol is a conjunction of positive or negative atomic propositions. 
We write s = po A p2 Ap, for the partial symbol specifying that po and po 
hold and p; does not. The definition of a symbol satisfying a partial symbol is 
natural: for instance the symbol {po, p2, p4} satisfies s. The width of a partial 
symbol is the number of atomic propositions it uses. 
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Directed LTL is defined by the following grammar: 
p=X"s | FX"s | X"(sAy) | EX”(s^ọ), 


where s is a partial symbol and n € {0,1,---}. As an example, the directed 
formula 
F((p ^q) A F X? =p) 


reads: there exists a position satisfying p ^ q, and at least two positions later, 
there exists a position satisfying ~p. The intuition behind the term “directed” 
is that a directed formula fixes the order in which the partial symbols occur. 
A non-directed formula is Fp A Fq: there is no order between p and q. Note 
that Directed LTL only uses the X and F operators as well as conjunctions and 
atomic propositions. 


Generating directed formulas. Let us consider the following problem: given 
the sample S = PU N, we want to generate all directed formulas together with 
a list of traces in S, they satisfy. Our first technical contribution and key to 
the scalability of our approach is an efficient solution to this problem based on 
dynamic programming. 

Let us define a natural order in which we want to generate directed formulas. 
They have two parameters: length, which is the number of partial symbols in the 
directed formula, and width, which is the maximum of the widths of the partial 
symbols in the directed formula. We consider the order based on summing these 
two parameters: 


(1,1), (2,1), (1, 2), (3, 1), (2, 2), (1, 3),... 


(We note that in practice, slightly more complicated orders on pairs are useful 
since we want to increase the length more often than the width.) Our enumer- 
ation algorithm works by generating all directed formulas of a given pair of 
parameters in a recursive fashion. Assuming that we already generated all di- 
rected formulas for the pair of parameters (£, w), we define two procedures, one 
for generating the directed formulas for the parameters (+ 1, w), and the other 
one for (€,w +1). 

When we generate the directed formulas, we also keep track of which traces 
in the sample they satisfy by exploiting a dynamic programming table called 
LASTPos. We define it is as follows, where y is a directed formula and t¢ a trace 
in S: 

LasTPos(y,t) = {i € [1, |t|] : t[1, i] H= p}. 
The main benefit of LASTPos is that it meshes well with directed formulas: it 
is algorithmically easy to compute them recursively on the structure of directed 
formulas. 

A useful idea is to change the representation of the set of traces S, by pre- 
computing the lookup table INDEX defined as follows, where t is a trace in S, s 
a partial symbol, and in [1, |t|]: 


INDEX(t, 5,2) = {j € [i + 1, |t|] : tly] = s}. 
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The table INDEX can be precomputed in linear time from S, and makes the 
dynamic programming algorithm easier to formulate. 

Having defined the important ingredients, we now present the pseudocode 2 
for both increasing the length and width of a formula. For the length increase 
algorithm, we define two extension operators A=, and A>,x that “extend” the 
length of a directed formula y by including a partial symbol s in the formula. 
Precisely, the operator s A=, y replaces the rightmost partial symbol s’ in y 
with (s’ A X*s), while s A>k 9 replaces s’ with (s’ A F X“ s). For instance, 
c^=2 X(a A Xb) = X(a A X(bA X° c¢)). For the width increase algorithm, we 
say that two directed formulas are compatible if they are equal except for partial 
symbols. For two compatible formulas, we define a pointwise-and (A) operator 
that takes the conjunction of the corresponding partial symbols at the same 
positions. For instance, X(a \ Xb) A X(bA Xc) = X((aA b) A X(bA c)). The 
actual implementation of the algorithm refines the algorithms in certain places. 
For instance: 


— Line 3: instead of considering all partial symbols, we restrict to those ap- 
pearing in at least one positive trace. 

— Line 13: some computations for p>; can be made redundant; a finer data 
structure factorises the computations. 

— Line 25: using a refined data structure, we only enumerate compatible di- 
rected formulas. 


Lemma 1. Algorithm 2 generates all directed formulas and correctly computes 
the tables LASTPOs. 


The dual point of view. We use the same algorithm to produce formulas in 
a dual fragment to directed LTL, which uses the X and G operators, the last 
predicate, as well as disjunctions and atomic propositions. The only difference is 
that we swap positive and negative traces in the sample. We obtain a directed 
formula from such a sample and apply its negation as shown below: 


“XpalastVX-p ; “Fye=Grp ; 7p Ag2) = ny1 V2. 


5 Boolean combinations of formulas 


As explained in the previous section, we can efficiently generate directed formulas 
and dual directed formulas. Now we explain how to form a Boolean combination 
of these formulas in order to construct separating formulas, as illustrated in the 
introduction. 


Boolean combination of formulas. Let us consider the following subproblem: 
given a set of formulas, does there exist a Boolean combination of some of the 
formulas that is a separating formula? We call this problem the Boolean subset 
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Algorithm 2 Generation of directed formulas for the set of traces S 


1: procedure SEARCH DIRECTED FORMULAS — LENGTH INCREASE(E, w) 
for all directed formulas y of length £ and width w do 

3 for all partial symbols s of width at most w do 
4 for allt € S do 

5: I = LAstPos(y, t) 

6: for alli € I do 
T: 

8 


J = INDEX (t, s, i) 
for all j € J do 


9: =j < Ss A=(j-i) (o 
10: add j to LasTPos(y-=;, t) 
11: end for 
12: for all j’ < max(J) do 
13: P>j' — 8 N>(j—i) P 
14: add JN [j’, |t|] to LAsTPOs(p>;7,t) 
15: end for 
16: end for 
17: end for 
18: end for 
19: end for 
20: end procedure 
21: 


22: procedure SEARCH DIRECTED FORMULAS — WIDTH INCREASE(E, w) 
23: for all directed formulas y of length £ and width w do 


24: for all directed formulas y’ of length £ and width 1 do 

25: if y and y’ are compatible then 

26: gH pag’ 

27: for allt € S do 

28: LastPos(wy”, t) + LastPos(y, t) N LAstPos(y’,t) 
29: end for 

30: end if 

31: end for 

32: end for 


33: end procedure 


cover, which is illustrated in Figure 1. In this example we have three formulas 
1, Y2, and ys, each satisfying subsets of u1, U2, U3, V1, V2, V3 as represented in the 
drawing. Inspecting the three subsets reveals that (p1 A 92) V v3 is a separating 
formula. 

The Boolean subset cover problem is a generalization of the well known and 
extensively studied subset cover problem, where we are given S},..., Sm subsets 
of [1,n], and the goal is to find a subset I of [1,m] such that U,., S; covers 
all of [1,n] — such a set I is called a cover. Indeed, it corresponds to the case 
where all formulas satisfy none of the negative traces: in that case, conjunc- 
tions are not useful, and we can ignore the negative traces. The subset cover 
problem is known to be NP-complete. However, there exists a polynomial-time 
log(n)-approximation algorithm called the greedy algorithm: it is guaranteed to 
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mo U1 Cal — 
U2 V2 
U3 U3 


Li 


Fig. 1: The Boolean subset cover problem: the formulas 1,2, and y3 satisfy 
the words encircled in the corresponding area; in this instance (p1 A ¢2) V v3 is 
a separating formula. 


construct a cover that is at most log(n) times larger than the minimum cover. 
This approximation ratio is optimal in the following sense [7]: there is no poly- 
nomial time (1 — o(1)) log(n)-approximation algorithm for subset cover unless 
P = NP. Informally, the greedy algorithm for the subset cover problem does 
the following: it iteratively constructs a cover I by sequentially adding the most 
‘promising subset’ to J, which is the subset S; maximising how many more ele- 
ments of [1,n] are covered by adding i to J. 


We introduce an extension of the greedy algorithm to the Boolean subset 
cover problem. The first ingredient is a scoring function, which takes into account 
both how close the formula is to being separating, and how large it is. We use 
the following score: 


Card({t € P : t = y})+Card({te N : t E y}) 
Score(p) = ) 


Viel+1 


where |y| is the size of y. The use of y- is empirical, it is used to mitigate the 
importance of size over being separating. 

The algorithm maintains a set of formulas B, which is initially the set of 
formulas given as input, and add new formulas to B until finding a separating 
formula. Let us fix a constant K, which in the implementation is set to 5. At 
each point in time, the algorithm chooses the K formulas ¥1,...,y~« with the 
highest score in B and constructs all disjunctions and conjunctions of p; with 
formulas in B. For each i, we keep the disjunction or conjunction with a maximal 
score, and add this formula to B if it has higher score than y;. We repeat this 
procedure until we find a separating formula or no formula is added to B. 


Another natural approach to the Boolean subset cover problem is to use deci- 
sion trees: we use one variable for each trace and one atomic proposition for each 
formula to denote whether the trace satisfies the formula. We then construct a 
decision tree classifying all traces. We experimented with both approaches and 
found that the greedy algorithm is both faster and yields smaller formulas. We do 
not report on these experiments because the formulas output using the decision 
tree approach are prohibitively larger and therefore not useful for explanations. 
Let us, however, remark that using decision trees we get a theoretical guaran- 
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tee that if there exists a separating formula as a Boolean combination of the 
formulas, then the algorithm will find it. 


6 Theoretical guarantees 


The following result shows the relevance of our approach using directed LTL and 
Boolean combinations. 


Theorem 1. Every formula of LTL(F,X,/,V) is equivalent to a Boolean com- 
bination of directed formulas. Equivalently, every formula of LTL(G, X, ^, V) is 
equivalent to a Boolean combination of dual directed formulas. 


The proof of Theorem 1 can be found in the extended version of the pa- 
per [21]. To get an intuition, let us consider the formula F p A F q, which is not 
directed, but equivalent to F(p A^ F q) V E(q A^ F p). In the second formulation, 
there is a disjunction over the possible orderings of p and q. The formal proof 
generalises this rewriting idea. 

This implies the following properties for our algorithm: 


— terminating: given a bound on the size of formulas, the algorithm eventually 
generates all formulas of bounded size, 

— correctness: if the algorithm outputs a formula, then it is separating, 

— completeness: if there exists a separating formula in LTL(F, G, X, A, V) with 
no nesting of F and G, then the algorithm finds a separating formula. 


7 Experimental evaluation 


In this section, we answer the following research questions to assess the perfor- 
mance of our LTL learning algorithm. 


RQ1: How effective are we in learning concise LTL formulas from samples? 
RQ2: How much scalability do we achieve through our algorithm? 
RQ3: What do we gain from the anytime property of our algorithm? 


Experimental Setup. To answer the questions above, we have implemented 
a prototype of our algorithm in Python 3 in a tool named SCARLET? (SCalable 
Anytime algoRithm for LEarning 1T1). We run SCARLET on several benchmarks 
generated synthetically from LTL formulas used in practice. To answer each 
research question precisely, we choose different sets of LTL formulas. We discuss 
them in detail in the corresponding sections. Note that, however, we did not 
consider any formulas with U-operator, since SCARLET is not designed to find 
such formulas. 

To assess the performance of SCARLET, we compare it against two state-of- 
the-art tools for learning logic formulas from examples: 


6 https://github.com/rajarshi008/Scarlet 
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1. FLIE’, developed by [19], infers minimal LTL formulas using a learning al- 
gorithm that is based on constraint solving (SAT solving). 

2. SYSLITE®, developed by [1], originally infers minimal past-time LTL formulas 
using an enumerative algorithm implemented in a tool called CVC4SY [23]. 
For our comparisons, we use a version of SYSLITE that we modified (which 
we refer to as SYSLITE,) to infer LTL formulas rather than past-time LTL 
formulas. Our modifications include changes to the syntactic constraints gen- 
erated by SYSLITE, as well as changing the semantics from past-time LTL 
to ordinary LTL. 


To obtain a fair comparison against SCARLET, in both the tools, we disabled the 
U-operator. This is because if we allow U-operator this will only make the tools 
slower since they will have to search through all formulas containing U. 

All the experiments are conducted on a single core of a Debian machine 
with Intel Xeon E7-8857 CPU (at 3GHz) using up to 6GB of RAM. We set 
the timeout to be 900s for all experiments. We include scripts to reproduce all 
experimental results in a publicly available artifact [22]. 


Table 1: Common LTL formulas used in practice 


Absence: G(-p), G(q > G(-p)) 
Existence: F(p), G(~p) V F(p A F(q)) 
Universality: G(p), G(q— G(p)) 


ee p SOP) V F(p A F(a) 
o o V G(as) V F(r A F(s)), 
F(r) V F(p) V F(q) 


Sample generation. To provide a comparison among the learning tools, we 
follow the literature [19,24] and use synthetic benchmarks generated from real- 
world LTL formulas. For benchmark generation, earlier works rely on a fairly 
naive generation method. In this method, starting from a formula y, a sample 
is generated by randomly drawing traces and categorizing them into positive 
and negative examples depending on the satisfaction with respect to y. This 
method, however, often results in samples that can be separated by formulas 
much smaller than y. Moreover, it often requires a prohibitively large amount of 
time to generate samples (for instance, for G p, where almost all traces satisfy a 
formula) and, hence, often does not terminate in a reasonable time. 

To alleviate the issues in the existing method, we have designed a novel 
generation method for the quick generation of large samples. In our method, 
we first convert the starting formula into an equivalent DFA and then extract 
accepted and rejected words to obtain a sample of the desired size. We provide 
more details on this new generation method used in the extended version [21]. 


T https: //github.com/ivan-gavran/samples2LTL 
8 https://github.com/CLC-Ulowa/SySLite 
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Fig. 2: Comparison of SCARLET, FLIE and SYSLITE, on synthetic benchmarks. 
In Figure 2a, all times are in seconds and ‘TO’ indicates timeouts. The size of 
bubbles in the figure indicate the number of samples for each datapoint. 


7.1 RQ1: Performance Comparison 


To address our first research question, we have compared all three tools on a 
synthetic benchmark suite generated from eight LTL formulas. These formulas 
originate from a study by Dwyer et al. [8], who have collected a comprehensive 
set of LTL formulas arising in real-world applications (see Table 1 for an excerpt). 
The selected LTL formulas have, in fact, also been used by FLIE for generating 
its benchmarks. While FLIE also considered formulas with U-operator, we did 
not consider them for generating our benchmarks due to reasons mentioned in 
the experimental setup. 

Our benchmark suite consists of a total of 256 samples (32 for each of the 
eight LTL formulas) generated using our generation method. The number of 
traces in the samples ranges from 50 to 2000, while the length of traces ranges 
from 8 to 15. 

Figure 2a presents the runtime comparison of FLIE, SYSLITE, and SCARLET 
on all 256 samples. From the scatter plots, we observe that SCARLET ran faster 
than FLIE on all samples. Likewise, SCARLET was faster than SYSLITE, on all 
but eight (out of 256) samples. SCARLET timed out on only 13 samples, while 
FLIE and SYSLITE, timed out on 85 and 36, respectively (see Figure 2b). 

The good performance of SCARLET can be attributed to its efficient formula 
search technique. In particular, SCARLET only considers formulas that have a high 
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potential of being a separating formula since it extracts Directed LTL formulas 
from the sample itself. FLIE and SYSLITE,, on the other hand, search through 
arbitrary formulas (in order of increasing size), each time checking if the current 
one separates the sample. 

Figure 2c presents the comparison of the size of the formulas inferred by 
each tool. On 170 out of the 256 samples, all tools terminated and returned an 
LTL formula with size at most 7. In 150 out of this 170 samples, SCARLET, FLIE, 
and SYSLITE, inferred formulas of equal size, while on the remaining 20 samples 
SCARLET inferred formulas that were larger. The latter observation indicates that 
SCARLET misses certain small, separating formulas, in particular, the ones which 
are not a Boolean combination of directed formulas. 

However, it is important to highlight that the formulas learned by SCARLET 
are in most cases not significantly larger than those learned by FLIE and SYSLITE,. 
This can be seen from the fact that the average size of formulas inferred by 
SCARLET (on benchmarks in which none of the tools timed out) is 3.21, while the 
average size of formulas inferred by FLIE and SYSLITE,, is 3.07. 

Overall, SCARLET displayed significant speed-up over both FLIE and SYSLITE, 
while learning a formula similar in size, answering question RQ1 in the positive. 


7.2 RQ2: Scalability 


To address the second research question, we investigate the scalability of SCARLET 
in two dimensions: the size of the sample and the size of the formula from which 
the samples are generated. 


Scalability with respect to the size of the samples. For demonstrating 
the scalability with respect to the size of the samples, we consider two formulas 
Pcov = F(ai) A F(a2) A F(az) and Ygeq = F(a; A F(a2 A F a3)), both of which 
appear commonly in robotic motion planning [10]. While the formula Ycov de- 
scribes the property that a robot eventually visits (or covers) three regions a1, 
az, and ag in arbitrary order, the formula Yseq describes that the robot has to 
visit the regions in the specific order a ,a2a3. 

We have generated two sets of benchmarks for both formulas for which we 
varied the number of traces and their length, respectively. More precisely, the 
first benchmark set contains 90 samples of an increasing number of traces (5 
samples for each number), ranging from 200 to 100 000, each consisting of traces 
of fixed length 10. On the other hand, the second benchmark set contains 90 
samples of 200 traces, containing traces from length 10 to length 50. As the 
results on both benchmark sets are similar, we here discuss the results on the 
first set and refer the readers to the extended version [21] for the second set. 

Figure 3a shows the average runtime results of SCARLET, FLIE, and SYSLITEL 
on the first benchmark set. We observe that SCARLET substantially outperformed 
the other two tools on all samples. This is because both Yeo, and Yseq are of 
size eight and inferring formulas of such size is computationally challenging for 
FLIE and SYSLITE,. In particular, FLIE and SYSLITE, need to search through 
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(b) Scalability in formula size 


Fig. 3: Comparison of SCARLET, FLIE and SYSLITE; on synthetic benchmarks. 
In Figure 3a, all times are in seconds and ‘TO’ indicates timeouts. 


all formulas of size upto eight to infer the formulas, while, SCARLET, due to its 
efficient search order (using length and width of a formula), infers them faster. 

From Figure 3a, we further observe a significant difference between the run 
times of SCARLET on samples generated from formula Yo, and from formula seq. 
This is evident from the fact that SCARLET failed to infer formulas for samples 
Of Yseq Starting at a size of 6000, while it could infer formulas for samples of 
Pcov Up to a size of 50000. Such a result is again due to the search order used 
by SCARLET: while Yeov is a Boolean combination of directed formulas of length 
1 and width 1, Yseq is a directed formula of length 3 and width 1. 


Scalability with respect to the size of the formula. To demonstrate the 
scalability with respect to the size of the formula used to generate samples, we 
have extended Ycov and Yseq to families of formulas (~%,,)nen\{o} With Prov = 
F(a1) A F(a2)A..-A F(an) and (93eq)nen\{o} With yf, = F(a AF(a2AF(...A 
F a,))), respectively. These family of formulas describe properties similar to that 
of Yeoy and Yseq, but the number of regions is parameterized by n € N \ {0}. We 
consider formulas from the two families by varying n from 2 to 5 to generate a 
benchmark suite consisting of samples (5 samples for each formula) having 200 
traces of length 10. 

Figure 3b shows the average run time comparison of the tools for samples 
from increasing formula sizes. We observe a trend similar to Figure 3a: SCARLET 


278 Raha et al. 


performs better than the other two tools and infers formulas of family 7, 
faster than that of yf... However, contrary to the near linear increase of the 
runtime with the number of traces, we notice an almost exponential increase of 
the runtime with the formula size. 

Overall, our experiments show better scalability with respect to sample and 
formula size compared against the other tools, answering RQ2 in the positive. 


7.3 RQ3: Anytime Property 


To answer RQ3, we list two advantages of the anytime property of our algorithm. 
We demonstrate these advantages by showing evidence from the runs of SCARLET 
on benchmarks used in RQ1 and RQ2. 

First, in the instance of a time out, our algorithm may find a “concise” 
separating formula while the other tools will not. In our experiments, we observed 
that for all benchmarks used in RQ1 and RQ2, SCARLET obtained a formula even 
when it timed out. In fact, in the samples from y°2,,, used in RQ2, SCARLET (see 
Figure 3b) obtained the exact original formula, that too within one second (0.7 
seconds in average), although timed out later. The time out was because SCARLET 
continued to search for smaller formulas even after finding the original formula. 

Second, our algorithm can actually output the final formula earlier than its 
termination. This is evident from the fact that, for the 243 samples in RQ1 where 
SCARLET does not time out, the average time required to find the final formula 
is 10.8 seconds, while the average termination time is 25.17 seconds. Thus, there 
is a chance that even if one stops the algorithm earlier than its termination, one 
can still obtain the final formula. 

Our observations from the experiments clearly indicate the advantages of 
anytime property to obtain a concise separating formula and thus, answering 
RQ3 in the positive. 


8 Conclusion 


We have proposed a new approach for learning temporal properties from exam- 
ples, fleshing it out in an approximation anytime algorithm. We have shown in 
experiments that our algorithm outperforms existing tools in two ways: it scales 
to larger formulas and input samples, and even when it timeouts it often outputs 
a separating formula. 

Our algorithm targets a strict fragment of LTL, restricting its expressivity 
in two aspects: it does not include the U (“until”) operator, and we cannot nest 
the eventually and globally operators. We leave for future work to extend our 
algorithm to full LTL. 

An important open question concerns the theoretical guarantees offered by 
the greedy algorithm for the Boolean subset cover problem. It extends a well 
known algorithm for the classic subset cover problem and this restriction has 
been proved to yield an optimal log(n)-approximation. Do we have similar guar- 
antees in our more general setting? 
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Abstract. We introduce a similarity function on formulae of signal tem- 
poral logic (STL). It comes in the form of a kernel function, well known 
in machine learning as a conceptually and computationally efficient tool. 
The corresponding kernel trick allows us to circumvent the complicated 
process of feature extraction, i.e. the (typically manual) effort to identify 
the decisive properties of formulae so that learning can be applied. We 
demonstrate this consequence and its advantages on the task of predict- 
ing (quantitative) satisfaction of STL formulae on stochastic processes: 
Using our kernel and the kernel trick, we learn (i) computationally effi- 
ciently (ii) a practically precise predictor of satisfaction, (iii) avoiding the 
difficult task of finding a way to explicitly turn formulae into vectors of 
numbers in a sensible way. We back the high precision we have achieved 
in the experiments by a theoretically sound PAC guarantee, ensuring our 
procedure efficiently delivers a close-to-optimal predictor. 


1 Introduction 


Is it possible to predict the probability that a system satisfies a property without 
knowing or executing the system, solely based on previous experience with the 
system behaviour w.r.t. some other properties? More precisely, let Pmp] denote 
the probability that a (linear-time) property p holds on a run of a stochastic 
process M. Is it possible to predict Py[py| knowing only Pm[y:] for properties 
pı,- Yk, which were randomly chosen (a-priori, not knowing y) and thus do 
not necessarily have any logical relationship, e.g. implication, to y? 

While this question cannot be in general answered with complete reliability, 
we show that in the setting of signal temporal logic, under very mild assumptions, 
it can be answered with high accuracy and low computational costs. 
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Probabilistic verification and its limits. Stochastic processes form a natural 
way of capturing systems whose future behaviour is determined at each moment 
by a unique (but possibly unknown) probability measure over the successor 
states. The vast range of applications includes not only engineered systems such 
as software with probabilistic instructions or cyber-physical systems with failures 
but also naturally occurring systems such as biological systems. In all these cases, 
predictions of the system behaviour may be required even in cases the system is 
not (fully) known or is too large. For example, consider a safety-critical cyber- 
physical system with a third-party component, or a complex signalling pathway 
to be understood and medically exploited. 


Probabilistic model checking, e.g. [4], provides a wide repertoire of analysis 
techniques, in particular to determine the probability Pys[y] that the system 
M satisfies the logical formula y. However, there are two caveats. Firstly, de- 
spite recent advances, [12] the scalability is still quite limited, compared to e.g. 
hardware or software verification. Moreover, this is still the case even if we only 
require approximate answers, i.e., for a given precision €, to determine v such 
that Pule] € [v — £, v + £]. Secondly, knowledge of the model M is required to 
perform the analysis. 


Statistical model checking [33] fights these two issues at an often acceptable 
cost of relaxing the guarantee to probably approximately correct (PAC), requiring 
that the approximate answer of the analysis may be incorrect with probability at 
most 6. This allows for a statistical evaluation: Instead of analyzing the model, 
we evaluate the satisfaction of the given formula on a number of observed runs 
of the system and derive a statistical prediction, which is valid only with some 
confidence. Nevertheless, although M may be unknown, it is still necessary to 
execute the system in order to obtain its runs. 


“Learning” model checking is a new paradigm we propose, in order to fill in 
a hole in the model-checking landscape where very little access to the system 
is possible. We are given a set of input-output pairs for model checking, i.e., 
a collection {(w;,p;)}; of formulae and their satisfaction values on a given model 
M, where p; can be the probability Pm[y:] of satisfying Yi, or its robustness 
(in case of real-valued logics), or any other quantity. From the data, we learn a 
predictor for the model checking problem: a classifier for Boolean satisfaction, 
or a regressor for quantitative domains of p;. Note that apart from the results 
on the a-priori given formulae, no knowledge of the system is required; also, no 
runs are generated and none have to be known. As an example consequence, a 
user can investigate properties of a system even before buying it, solely based 
on producer’s guarantees on the standardized formulae i. 


Advantages of our approach can be highlighted as follows, not intending to 
replace standard model checking in standard situations but focusing on the case 
of extremely limited (i) information and (ii) online resources. Probabilistic model 
checking re-analyzes the system for every new property on the input; statistical 
model checking can generate runs and then, for every new property, analyzes 
these runs; learning model checking performs one analysis with complexity de- 
pendent only on the size of the data set (a-priori formulae) and then, for every 
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new formula on input, only evaluates a simple function (whose size is again in- 
dependent of the system and the property, and depends only on the data set 
size). Consequently, it has the least access to information and the least compu- 
tational demands. While lack of any guarantees is typical for machine-learning 
techniques and, in this context with the lowest resources required, expectable, 
yet we provide PAC guarantees. 


Technique and our approach. To this end, we show how to efficiently learn 
on the space of temporal formulae via the so-called kernel trick, e.g. [32]. This in 
turn requires to introduce a mapping of formulae to vectors (in a Hilbert space) 
that preserves the information on the formulae. How to transform a formula 
into a vector of numbers (of always the same length)? While this is not clear 
at all for finite vectors, we take the dual perspective on formulae, namely as 
functionals mapping trajectories to values. This point of view provides us with 
a large bag of functional analysis tools [11] and allows us to define the needed 
semantic similarity of two formulae (the inner product on the Hilbert space). 


Application examples. Having discussed the possibility of learning model 

checking, the main potential of our kernel (and generally introducing kernels for 

any further temporal logics) is that it opens the door to efficient learning on 

formulae via kernel-based machine-learning techniques [27,31]. Let us sketch a 

few further applications that immediately suggest themselves: 

Game-based synthesis Synthesis with temporal-logic specifications can often 
be solved via games on graphs [25,19]. However, exploration of the game 
graph and finding a winning strategy is done by graph algorithms ignoring 
the logical information. For instance, choosing between a and ~a is tried 
out blindly even for specifications that require us to visit as. Approaches 
such as [21] demonstrate how to tackle this but hit the barrier of inefficient 
learning of formulae. Our kernel will allow for learning reasonable choices 
from previously solved games. 

Translating, sanitizing and simplifying specifications A formal specifica- 
tion given by engineers might be somewhat different from their actual inten- 
tion. Using the kernel, we can, for instance, find the closest simple formula 
to their inadequate translation from English to logic, which is then likely 
to match better. (Moreover, the translation would be easier to automate by 
natural language processing since learning from previous cases is easy once 
the kernel gives us an efficient representation for formulae learning.) 

Requirement mining A topic which received a lot of attention recently is that 
of identifying specifications from observed data, i.e. to tightly characterize a 
set of observed behaviours or anomalies [7]. Typical methods are using either 
formulae templates [6] or methods based e.g. on decision trees [9] or genetic 
algorithms [28]. Our kernel opens a different strategy to tackle this problem: 
lifting the search problem from the discrete combinatorial space of syntactic 
structures of formulae to a continuous space in which distances preserve 
semantic similarity (using e.g. kernel PCA [27] to build finite-dimensional 
embeddings of formulae into R™). 
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Our main contributions are the following: 

— From the technical perspective, we define a kernel function for temporal 
formulae (of signal temporal logic, see below) and design an efficient way 
to learn it. This includes several non-standard design choices, improving the 
quality of the predictor (see Conclusions). 

— Thereby we open the door to various learning-based approaches for analysis 
and synthesis and further applications, in particular also to what we call the 
learning model checking. 

— We demonstrate the efficiency practically on predicting the expected satis- 
faction of formulae on stochastic systems. We complement the experimental 
results with a theoretical analysis and provide a PAC bound. 


1.1 Related Work 


Signal temporal logic (STL) [24] is gaining momentum as a requirement 
specification language for complex systems and, in particular, cyber-physical sys- 
tems [7]. STL has been applied in several flavours, from runtime-monitoring [7], 
falsification problems [17] to control synthesis [18], and recently also within learn- 
ing algorithms, trying to find a maximally discriminating formula between sets 
of trajectories [9,6]. In these applications, a central role is played by the real- 
valued quantitative semantics [15], measuring robustness of satisfaction. Most of 
the applications of STL have been directed to deterministic (hybrid) systems, 
with less emphasis on non-deterministic or stochastic ones [5]. 


Metrics and distances form another area in which formal methods are pro- 
viding interesting tools, in particular logic-based distances between models, like 
bisimulation metrics for Markov models [2,3,1], which are typically based on a 
branching logic. In fact, extending these ideas to linear time logic is hard [14], 
and typically requires statistical approximations. Finally, another relevant prob- 
lem is how to measure the distance between two logic formulae, thus giving a 
metric structure to the formula space, a task relevant for learning which received 
little attention for STL, with the notable exception of [23]. 


Kernels make it possible to work in a feature space of a higher dimension 
without increasing the computational cost. Feature space, as used in machine 
learning [31,13], refers to an n-dimensional real space that is the co-domain 
of a mapping from the original space of data. The idea is to map the original 
space in a new one that is easier to work with. The so-called kernel trick, e.g. [32] 
allows us to efficiently perform approximation and learning tasks over the feature 
space without explicitly constructing it. We provide the necessary background 
information in Section 2.2. 


Overview of the paper: Section 2 recalls STL and the classic kernel trick. Sec- 
tion 3 provides an overview of our technique and results. Section 4 then discusses 
all the technical development in detail. In Section 5, we experimentally evaluate 
the accuracy of our learning method. In Section 6, we conclude with future work. 
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Let R,Rso,Q,N denote the sets of non-negative real, rational, and (positive) 
natural numbers, respectively. For vectors x,y € R” (with n € N), we write 
g£ = (z1, .. . , Zn) to access the components of the vectors, in contrast to sequences 
of vectors £1, %2,... € R”. Further, we write (x,y) = >, Ziyi for the scalar 
product of vectors. 


2.1 Signal Temporal Logic 


Signal Temporal Logic (STL) [24] is a linear-time temporal logic suitable 
to monitor properties of trajectories. A trajectory is a function €: I > D witha 
time domain I C Ryo, and a state space D C R” for some n € N. We define the 
trajectory space T as the set of all possible continuous functions” over D. An 
atomic predicate of STL is a continuous computable predicate? on x € R” of the 
form of f(x1,...,%n) > 0, typically linear, ie. 0", giv; > 0 for q,...,¢n € Q. 


Syntax. The set P of STL formulae is given by the following syntax: 
g:=tt|r|>~| 91 A v2 | p1Uja, b] p2 

where tt is the Boolean true constant, 7 ranges over atomic predicates, negation 
~ and conjunction ^A are the standard Boolean connectives and Uj, 4) is the 
until operator, with a,b E€ Q and a < b. As customary, we can derive the 
disjunction operator V by De Morgan’s law and the eventually (a.k.a. future) 
operator Fita] and the always (a.k.a. globally) operator Gy, 4, operators from 
the until operator. 


Semantics. STL can be given not only the classic Boolean notion of satisfaction, 
denoted by s(y,€,t) = 1 if € at time t satisfies y, and 0 otherwise, but also a 
quantitative one, denoted by p(y, €,t). This measures the quantitative level of 
satisfaction of a formula for a given trajectory, evaluating how “robust” is the 
satisfaction of y with respect to perturbations in the signal [15]. The quantitative 
semantics is defined recursively as follows: 


p(t, &, t) =f,(€(t)) for n(£1, ity) = (fr(£1, En) 2 0) 
pg, £, t) =— pl, €,t) 

pli A v2,€,t) =min (p(p1,€,t), p(y2, €,t)) 

(1 U a,b) P2; E, t) Ee M (o(y2, £t’), n aleng a) 


Soundness and Completeness Robustness is compatible with satisfaction in 
that it complies with the following soundness property: if p(y,&,t) > 0 then 
s(y,€,t) = 1; and if p(y, €,t) < 0 then s(y, €,¢) = 0. If the robustness is 0, both 


5 The whole framework can be easily relaxed to piecewise continuous cadlag trajecto- 
ries endowed with the Skorokhod topology and metric [8]. 

ê Results are easily generalizable to predicates defined by piecewise continuous cadlag 
functions. 
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satisfaction and the opposite may happen, but either way only non-robustly: 
there are arbitrarily small perturbations of the signal so that the satisfaction 
changes’. In fact, it complies also with a completeness property that p measures 
how robust the satisfaction of a trajectory is with respect to perturbations, 
see [15] for more detail. 


Stochastic process in this context is a probability space M = (T, A, u), where 
T is a trajectory space and u is a probability measure on a o-algebra A over 
T. Note that the definition is essentially equivalent to the standard definition of 
a stochastic process as a collection {D;}z¢7 of random variables, where D,(€) € D 
is the signal €(t) at time t on € [8]. The only difference is that we require, for 
simplicity®, the signal be continuous. 


Expected robustness and satisfaction probability. Given a stochastic pro- 
cess M = (T, A, p), we define the expected robustness Rm : P x I> Ras 


Ruli) = Emolo E= | EDO. 
EET 
The qualitative counterpart of the expected robustness is the satisfaction proba- 
bility S(p), i.e. the probability that a trajectory generated by the stochastic pro- 


cess M satisfies the formula p: Sm (p, t) := Em([s(¢, £ t)] = feer 8(¥,& t)dulé).” 
Finally, when t = 0 we often drop the parameter t from all these functions. 


2.2 Kernel Crash Course 
We recall the needed background for readers less familiar with machine learning. 


Learning linear models. Linear predictors take the form of a vector of weights, 
intuitively giving positive and negative importance to features. A predictor 
given by a vector w = (wi,...,Wa) evaluates a data point x = (z1,..., £a) 
to wia1 +--+ wata = (w, £). To use it as a classifier, we can, for instance, take 
the sign of the result and output yes iff it is positive; to use it as a regressor, we 
can simply output the value. During learning, we are trying to separate, respec- 
tively approximate, the training data a1,...x%N~ with a linear predictor, which 
corresponds to solving an optimization problem of the form (f is a suitable loss) 


ora f((w,21),...,(w, £N), (w,w)) 
wERt 


where the possible, additional last term comes from regularization (preference 
of simpler weights, with lots of zeros in w). 


7 The satisfaction of subformulae changes and, provided the predicates are “indepen- 
dent” of each other, the satisfaction of the whole formula, too. 

8 Again, this assumption can be relaxed since continuous functions are dense in the 
Skorokhod space of cadlag functions. 

° As argued above, this is essentially equivalent to integrating the indicator function 
of robustness being positive since a formula has robustness exactly zero only with 
probability zero as we sample all values from continuous distributions. 
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Need for a feature map @ : Input — R”. In order to learn, the input object 
first needs to be transformed to a vector of numbers. For instance, consider 
learning the logical exclusive-or function (summation in Z2) y = £1 ® £2. Seeing 
true as 1 and false as 0 already transforms the input into elements of R?. However, 
observe that there is no linear function separating sets of points {(0,0), (1, 1)} 
(where xor returns true) and {(0, 1), (1,0)} (where xor returns false). In order to 
facilitate learning by linear classifiers, richer feature space may be needed than 
what comes directly with the data. In our example, we can design a feature map 
to a higher-dimensional space using ® : (£1, £2) œ> (£1, £2, £1 ` £2). Then e.g. 
£3 < zitzo-l holds in the new space iff xı ® £2 and we can learn this linear 
classifier. 

Another example can be seen in 
Fig. 1. The inner circle around zero 
cannot be linearly separated from 
the outer ring. However, considering 
£3 := £? + z2 as an additional feature 
turns them into easily separable lower 
and higher parts of a paraboloid. 

In both examples, a feature map 
® mapping the input to a space with 
higher dimension (R), was used. Nev- 
ertheless, two issues arise: 


Fig. 1. An « 
feature maps in linear classification [20]. 


1. What should be the features? Where do we get good candidates? 
2. How to make learning efficient if there are too many features? 


On the one hand, identifying the right features is hard, so we want to consider 
as many as possible. On the other hand, their number increases the dimension 
and thus decreases the efficiency both computationally and w.r.t. the number of 
samples required. 


Kernel trick. Fortunately, there is a way to consider a huge amount of features, 
but with efficiency independent of their number (and dependent only on the 
amount of training data)! This is called the kernel trick. It relies on two properties 
of linear classifiers: 
— The optimization problem above, after the feature map is applied, takes the 
form . 
arg min f ((w, ®(a1)),..., (w, ®(an)), (w, w)) 
weER” 


— Representer theorem: The optimum of the above can be written in the form 
N 
w* = 5 a, P(x;) 
i=1 
Intuitively, anything orthogonal to training data cannot improve precision 


of the classification on the training data, and only increases ||w||, which we 
try to minimize (regularization). 
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Consequently, plugging the latter form into the former optimization problem 
yields an optimization problem of the form (g is a suitable loss derived from f): 


arg min g(a, (P(a;), P(@j))1<i,3<N) 
acRN 
In other words, optimizing weights œ of expressions where data only appear in 
the form ((x;), P(x;)). Therefore, we can take all features in &(a;) into account 
if, at the same time, we can efficiently evaluate the kernel function 


k : (x,y) +> (G(x), B(y)) 


i.e. without explicitly constructing (a) and ®(y). Then we can efficiently learn 
the predictor on the rich set of features. Finally, when the predictor is applied 
to a new point x, we only need to evaluate the expression 


N N 
(w, P(a)) = 3 a;l(B(x;), B(£)) = 2 aik(xi, £) 


3 Overview of Our Approach and Results 


In this section, we describe what our tasks are if we want to apply the kernel 
trick in the setting of temporal formulae, what our solution ideas are, and where 
in the paper they are fully worked out. 


1. Design the kernel function: define a similarity measure for STL formulae and 

prove it takes the form (&(-), B(-)) 

(a) Design an embedding of formulae into a Hilbert space (vector space with 

possibly infinite dimension) ({10], Thm.3 in App.B proves this is well 
defined): Although learning can be applied also to data with complex 
structures such as graphs, the underlying techniques typically work on 
vectors. How do we turn a formula into a vector? 
Instead of looking at the syntax of the formula, we can look at its seman- 
tics. Similarly to Boolean satisfaction, where a formula can be identified 
with its language, i.e., the set T > 2 & 27 of trajectories that satisfy 
it, we can regard an STL formula y as a map p(y,:) : T > R & RT 
of trajectories to their robustness. Observe that this is a real function, 
i.e., an infinite-dimensional vector of reals. Although explicit computa- 
tions with such objects are problematic, kernels circumvent the issue. In 
summary, we have the implicit features given by the map: 


p 
pr plp) 


(b) Design similarity on the feature representation (in Sec. 4.1): Vectors’ 
similarity is typically captured by their scalar product (x, y) = 0; ziyi 
since it gets larger whenever the two vectors “agree” on a component. 
In complete analogy, we can define for infinite-dimensional vectors (i.e. 
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functions) f,g their “scalar product” (f,g) = f f(x)g(a) dx. Hence we 
want the kernel to be defined as 


kle: p) = ole ), et, )) = f MPDE d 


(c) Design a measure on trajectories (Sec. 4.2): Compared to finite-dimen- 
sional vectors, where in the scalar product each component is taken with 
equal weight, integrating over uncountably many trajectories requires us 
to put a finite measure on them, according to which we integrate. Since, 
as a side effect, it necessarily expresses their importance, we define a 
probability measure po preferring “simple” trajectories, where the signals 
do not change too dramatically (the so-called total variation is low). This 
finally yields the definition of the kernel as'° 


Hoare | _ Pheé)ali€) aol) D 


2. Learn the kernel (Sec. 5.1): 

(a) Get training data x;: The formulae for training should be chosen ac- 
cording to the same distribution as they are coming in the final task 
of prediction. Since that distribution is unknown, we assume at least 
a general preference of simple formulae and thus design a probability 
distribution Fo, preferring formulae with simple syntax trees (see Sec- 
tion 5.1). We also show that several hundred formulae are sufficient for 
practically precise predictions. 

(b) Compute the “correlation” of the data (o(x;),o(x3)) by kernel k(x, æj): 
Now we evaluate (1) for all the data pairs. Since this involves an inte- 
gral over all trajectories, we simply approximate it by Monte Carlo: We 
choose a number of trajectories according to po and sum the values for 
those. In our case, 10000 provide a very precise approximation. 

(c) Optimize the weights a (using values from (b)): Thus we get the most 
precise linear classifier given the data, but penalizing too “complicated” 
ones since they tend to overfit and not generalize well (so-called regular- 
ization). Recall that the dimension of œ is the size of the training data 
set, not the infinity of the Hilbert space. 

3. Evaluate the predictive power of the kernel and thus implicitly the kernel 
function design: 

— We evaluate the accuracy of predictions of robustness for single trajec- 
tories (Sec. 5.2), the expected robustness on a stochastic system and 
the corresponding Boolean notion of satisfaction probability (Sec. 5.3). 
Moreover, we show that there is no need to derive kernel for each stochas- 
tic process separately depending on their probability spaces, but the one 


10 On the conceptual level; technically, additional normalization and Gaussian trans- 
formation are performed to ensure usual desirable properties, see Cor. 1 in Sec. 4.1. 
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derived from the generic fp is sufficient and, surprisingly, even more 
accurate (Sec. 5.4). 

— Besides the experimental evaluation, we provide a PAC bound on our 
methods in terms of Rademacher complexity [26] (Sec. 4.4). 


4 A Kernel for Signal Temporal Logic 


In this section, we sketch the technical details of the construction of the STL 
kernel, of the correctness proof, and of PAC learning bounds. More details on 
the definition, including proofs, are provided in [10], Appendix B. 


4.1 Definition of STL Kernel 


Let us fix a formula y € P in the STL formulae space and consider the robustness 
p(y, +, -):T xI > R, seen as a real-valued function on the domain T x J, 
where J C R is a bounded interval, and 7 is the trajectory space of continuous 
functions. The STL kernel is defined as follows. 


Definition 1. Fixing a probability measure Uo on T, we define the STL-kernel 
k (pV) = feer Jeer UP E t), E, t)dtduo 


The integral is well defined as it corresponds to a scalar product in a suitable 
Hilbert space of functions. Formally proving this, and leveraging foundational 
results on kernel functions [26], in [10], Appendix B, we prove the following: 


Theorem 1. The function k’ is a proper kernel function. 


In the previous definition, we can fix time to t = 0 and remove the integration 
w.r.t. time. This simplified version of the kernel is called untimed, to distinguish 
it from the timed one introduced above. 


In the rest of the paper, we mostly work with two derived kernels, ko and k: 


k'(, p) 
k'(p, pki (p,p) 


The normalized kernel ko rescales k’ to guarantee that k(y, p) > k(y, y), Yp, Y € 
P. The Gaussian kernel k, additionally, allows us to introduce a soft threshold g? 
to fine tune the identification of significant similar formulae in order to improve 
learning. The following proposition is straightforward in virtue of the closure 
properties of kernel functions [26]: 


ko(y, Y) = 


kauae ( = =e) 


o2 


(2) 


Corollary 1. The functions ko and k are proper kernel functions. 
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4.2 The Base Measure po 


In order to make our kernel meaningful and not too expensive to compute, we 
endow the trajectory space 7 with a probability distribution such that more 
complex trajectories are less probable. We use the total variation [29] of a tra- 
jectory!! and the number of changes in its monotonicity as indicators of its 
“complexity” . 

Because later we use the probability measure uo for Monte Carlo approxima- 
tion of the kernel k, it is advantageous to define uo algorithmically, by providing 
a sampling algorithm. The algorithm samples from continuous piece-wise linear 
functions, a dense subset of 7, and is described in detail in [10], Appendix A. 
Essentially, we simulate the value of a trajectory at discrete steps A, for a total 
of N steps (equal to 100 in the experiments) by first sampling its total varia- 
tion distance from a squared Gaussian distribution, and then splitting such total 
variation into the single steps, changing sign of the derivative at each step with 
small probability q. We then interpolate linearly between consecutive points of 
the discretization and make the trajectory continuous piece-wise linear. 

In Section 5.4, we show that using this simple measure still allows us to make 
predictions with remarkable accuracy even for other stochastic processes on T. 


4.3 Normalized Robustness 


Consider the predicates zı — 10 > 0 and zı — 10” > 0. Given that we train and 
evaluate on uo, whose trajectories typically take values in the interval [—3, 3] (see 
also [10], Appendix A), both predicates are essentially equivalent for satisfiability. 
However, their robustness on the same trajectory differs by orders of magnitude. 
This very same effect, on a smaller scale, happens also when comparing x; > 10 
with x, > 20. In order to ameliorate this issue and make the learning less 
sensitive to outliers, we also consider a normalized robustness, where we rescale 
the value of the secondary (output) signal to (—1,1) using a sigmoid function. 
More precisely, given an atomic predicate m(21,...,2n) = (fr(@1,--.,2n) > 0), we 
define (t,£, t) = tanh (f,(%1,...,%,)). The other operators of the logic follow 
the same rules of the standard robustness described in Section 2.1. Consequently, 
both zı — 10 > 0 and x; — 10% > 0 are mapped to very similar robustness for 
typical trajectories w.r.t. 4o, thus reducing the impact of outliers. 


4.4 PAC Bounds for the STL Kernel 


Probably Approximately Correct (PAC) bounds [26] for learning provide a bound 
on the generalization error on unseen data (known as risk) in terms of the training 
loss plus additional terms which shrink to zero as the number of samples grows. 
These additional terms typically depend also on some measure of the complexity 
of the class of models we consider for learning (the so-called hypothesis space), 


4 The total variation of function f defined on [a,b] is V2(f) = 


SUPpep pare |f (wi41) n flæi)l, where P _ {P = {Xo, ..3 Bnp} | 
P is a partition of [a, b]}. 
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which ought to be finite. The bound holds with probability 1 — 6, where 6 > 0 
can be set arbitrarily small at the price of the bound getting looser. 

In the following, we will state a PAC bound for learning with STL kernels 
for classification. A bound for regression, and more details on the classification 
bound, can be found in [10], Appendix C. We first recall the definition of the 
risk L and the empirical risk L for classification. The former is an average of 
the zero-one loss over the data generating distribution Pdata, while the latter 
averages over a finite sample D of size m of Pdata. Formally, 


m 


Uh) = Eonar EAO) #u(y))] and Ev(h) = ~ Sho) # ule). 


i=1 


where y(ọp) is the actual class (truth value) associated with ọ, in contrast to the 
predicted class h(y), and I is the indicator function. 

The major issue with PAC bounds for kernels is that we need to constrain in 
some way the model complexity. This is achieved by requesting the functions that 
can be learned have a bounded norm. We recall that the norm ||h||q of a function 
h obtainable by kernel methods, i.e. h(y) = sy aik(yi, p), is ||Alla =a? Ka, 
where K is the Gram matrix (kernel evaluated between all pairs of input points, 
Ki; = k(yi, p;)). The following theorem, stating the bounds, can be proved by 
combining bounds on the Rademacher complexity for kernels with Rademacher 
complexity based PAC bounds, as we show in [10], Appendix C. 


Theorem 2 (PAC bounds for Kernel Learning in Formula Space). Let 
k be a kernel (e.g. normalized, exponential) for STL formulae P, and fiz A > 0. 
Let y : P — {1,1} be a target function to learn as a classification task. Then 
for any 6 > 0 and hypothesis function h with ||h||n < A, with probability at least 
1— ô it holds that 


(3) 


The previous theorem gives us a way to control the learning error, provided 
we restrict the full hypothesis space. Choosing a value of A equal to 40 (the 
typical value we found in experiments) and confidence 95%, the bound predicts 
around 650000 samples to obtain an accuracy bounded by the accuracy on the 
training set plus 0.05. This theoretical a-priori bound is much larger than the 
training set sizes in the order of hundreds, for which we observe good performance 
in practice. 


5 Experiments 


We test the performance of the STL kernel in predicting (a) robustness and 
satisfaction on single trajectories, and (b) expected robustness and satisfaction 
probability estimated statistically from K trajectories. Besides, we test the kernel 
on trajectories sampled according to the a-priori base measure jig and according 
to the respective stochastic models to check the generalization power of the 
generic uo-based kernel. Here we report the main results; for additional details 
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as well as plots and tables for further ways of measuring the error, we refer the 
interested reader to [10], Appendix D. 

Computation of the STL robustness and of the kernel was implemented in 
Python exploiting PyTorch [30] for parallel computation on GPUs. All the exper- 
iments were run on a AMD Ryzen 5000 with 16 GB of RAM and on a consumer 
NVidia GTX 1660Ti with 6 GB of DDR6 RAM. We run each experiment 1000 
times for single trajectories and 500 for expected robustness and satisfaction 
probability where we use 5000 trajectories for each run. Where not indicated 
differently, each result is the mean over all experiments. Computational time is 
fast: the whole process of sampling from uo, computing the kernel, doing regres- 
sion for training, test set of size 1000 and validation set of size 200, takes about 
10 seconds on GPU. We use the following acronyms: RE = relative error, AE= 
absolute error, MRE = mean relative error, MAE = mean absolute error, MSE 
= mean square error. 


5.1 Setting 

To compute the kernel itself, we sampled 10000 trajectories from po, using the 
sampling method described in Section 4.2. As regression algorithm (for optimiz- 
ing @ of Sections 2.2 and 3) we use the Kernel Ridge Regression (KRR) [27]. 
KRR was as good as, or superior, to other regression techniques (a comparison 
can be found in [10], Appendix D.1). 


Training and test set are composed of M formulae sampled randomly accord- 
ing to the measure Fo given by a syntax-tree random recursive growing scheme 
(reported in detail in [10], Appendix D.1), where the root is always an operator 
node and each node is an atomic predicate with probability pica (fixed in this 
experiments to 0.5), or, otherwise, another operator node (sampling the type 
using a uniform distribution). In these experiments, we fixed M = 1000. 


Hyperparameters We vary several hyperparameters, testing their impact on 
errors and accuracy. Here we briefly summarize the results. 

- The impact of formula complexity: We vary the parameter Pleat in the formula 
generating algorithm in the range [0.2, 0.3, 0.4, 0.5] (average formula size around 
(100, 25, 10, 6] nodes in the syntax tree), but only a slight increase in the median 
relative error is observed for more complex formulae: [0.045, 0.037, 0.031, 0.028]. 
- The addition of time bounds in the formulae has essentially no impact on the 
performance in terms of errors. 

- There is a very small improvement (< 10%) using integrating signals w.r.t. 
time (timed kernel) vs using only robustness at time zero (untimed kernel), but 
at the cost of a 5-fold increase in computational training time. 


- Size of training set: The error in es- w 
timating robustness decreases as we in- = 0:9 
crease the amount of training formulae, a -0.5 
see Fig. 2. However, already for a few hun- o 2o- Aao eoi s00 1000 


dred formulae, the predictions are quite Fig.2. MRE of predicted average ro- 
accurate. bustness vs the size of the training set. 
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- Exponential kernel k gives a 3-fold improvement in accuracy w.r.t. normalized 
kernel ko. 

- Dimensionality of signals: Error tends to increase linearly with dimensionality. 
For 1000 formulae in the training set, from dimension 1 to 5, MRE is [0.187, 
0.248, 0.359, 0.396, 0.488] and MAE is [0.0537, 0.0735, 0.0886, 0.098, 0.112]. 


5.2 Robustness and Satisfaction on Single Trajectories 


In this experiment, we predict the Boolean satisfiability of a formula using as 
a discriminator the sign of the robustness. We generate the training and test 
set of formulae using Fo, and the function sampling trajectories from fo with 
dimension n = 1, 2,3, using an independent sample than the one for evaluating 
the kernel. We evaluate the standard robustness p and the normalized one p of 
each trajectory for each formula in the training and test sets. We then predict p 
and ô for the test set and check if the sign of the predicted robustness agrees with 
that of the true one, which is a proxy for satisfiability, as discussed previously. 
Accuracy and distribution of the log;y, MRE over all experiments are reported 
in Fig. 3. Results are good for both but the normalized robustness performs 
always better. Accuracy is always greater than 0.96 and gets slightly worse when 
increasing the dimension. We report the mean of quantiles of p and p for RE 
and AE for n=3 (the toughest case) in Table 1 (top two rows). Errors for the 
normalized one are also always lower and slightly worsen when increasing the 
dimension. 

In Fig. 4 (left), we plot the true standard robustness for random test formulae 
in contrast to their predicted values and the corresponding log RE. Here we 


n=1 = 
300 300 mL 
EE standard EE standard 
= 2001 — medians 200 — medians 
Ss mm normalized i mm normalized 
o 100} — median n 100 — mediann 
07 0° = 
0.95 0.96 0.97 0.98 0.99 1.00 -15 -10 -05 0.0 0.5 1.0 
n=2 =2 
300 300 n 
€ 200 200 
s 
S 100 100 
Q: 0- 
0.95 0.96 0.97 0.98 0.99 1.00 -—1.5 =10 -05 0.0 0.5 1.0 
n=3 n=3 
300 300 
= 200 200 
> 
8 100 100 
Q: 0 
0.95 0.96 0.97 0.98 0.99 1.00 -15 -10 -05 0.0 0.5 1.0 
accuracy log MRE 


Fig. 3. Accuracy of satisfiability prediction (left) and logio of the MRE (right) over 
all 1000 experiments for standard and normalized robustness for samples from jo with 
dimensionality of signals n = 1,2,3. (Note the logarithmic scale, with log value of -1 
corresponding to 0.1 of the standard non-logarithmic scale.) 
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Table 1. Mean of quantiles for RE and AE over all experiments for prediction of the 
standard and normalized robustness (p, 6), expected robustness (R, Ê), the satisfaction 
probability (S) with trajectories sampled from jo and signals with dimensionality n=3, 
and of the normalized expected robustness on trajectories sampled from Immigration 
(1 dim), Isomerization (2 dim), and Transcription (3 dim) 


relative error (RE) absolute error (AE) 
5perc l1quart median 3quart 95perc 99perc|1quart median 3quart 99perc 
0.0035 0.018 0.045 0.141 0.870 4.28 0.016 0.039 0.105 0.689 
0.0008 0.001 0.006 0.019 0.564 2.86 {0.004 0.012 0.039 0.286 
0.0045 0.021 0.044 0.103 0.548 2.41 |0.013 0.029 0.070 0.527 
0.0006 0.003 0.007 0.020 0.133 0.55 {0.001 0.003 0.007 0.065 
0.0005 0.003 0.008 0.030 0.586 81.8 /0.001 0.003 0.007 0.072 
imm |0.0053 0.0067 0.016 0.049 0.360 1.83 |0.0037 0.008 0.019 0.151 
iso 0.0030 0.0092 0.026 0.091 0.569 2.74 0.0081 0.021 0.057 0.460 
trancr|0.0072 0.0229 0.071 0.240 1.490 7.55 0.018 0.049 0.12 0.680 


Dv ya eS 


predicted p 


=o =2 =} 0 1 2 3 “0.0 0.2 0.4 0.6 0.8 1.0 
standard robustness satisfaction probability 


Fig. 4. (left) True standard robustness vs predicted values and RE on single trajecto- 
ries sampled from jo. The misclassified formulae are the red crosses. (right) Satisfaction 
probability vs predicted values and RE (again for a single experiment). 


can clearly observe that the misclassified formulae (red crosses) tend to have a 
robustness close to zero, where even tiny absolute errors unavoidably produce 
large relative errors and frequent misclassification. 

We test our method also on three specifications of the ARCH-COMP 2020 
[16], to show that it works well even on real formulae. We obtain still good results, 
with an accuracy equal to 1, median AE = 0.0229, and median RE = 0.0316 in 
the worst case (the AT1 of the Automatic Transmission (AT) Benchmark, see 
[10], Appendix D.2). 


5.3 Expected Robustness and Satisfaction Probability 


In these experiments, we approximate the expected standard R() and normal- 
ized R(ọ) and the satisfaction probability $(¢) using a fixed set of 5000 tra- 
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jectories sampled according to uo, independent of the one used to compute the 
kernel, evaluating it for each formula in the training and test sets, and predicting 
R(v), R(y) and S(¢) for the test set. 

For the robustness, the mean of quantiles of RE and AE shows good results 
as can be seen in Table 1, rows 3-4. Values of MSE, MAE and MRE are smaller 
than those achieved on single trajectories with medians for n=3 equal to 0.0015, 
0.064, and 0.2 for R(y) and 0.00021, 0.0067, and 0.048 for the R(y). Normalized 
robustness continues to outperform the standard one. 

For the satisfaction probability, values of MSE and MAE errors are very low, 
with a median for n=3 equal to 0.000247 for MSE and 0.0759 for MAE. MRE 
instead is higher and equal to 3.21. The reason can be seen in Fig. 4 (right), 
where we plot the satisfaction probability vs the relative error for a random ex- 
periment. We can see that all large relative errors are concentrated on formulae 
with satisfaction probability close to zero, for which even a small absolute devi- 
ation can cause large errors. Indeed the 95th percentile of RE is still pretty low, 
namely 0.586 (cf. Table 1, row 5), while we observe the 99th percentile of RE 
blowing up to 81.8 (at points of near zero true probability). This heavy tailed 
behaviour suggests to rely on median for a proper descriptor of typical errors, 
which is 0.008 (hence the typical relative error is less than 1%). 


5.4 Kernel Regression on Other Stochastic Processes 


The last aspect that we investigate is whether the definition of our kernel w.r.t. 
the fixed measure uo can be used for making predictions also for other stochastic 
processes, i.e. without redefining and recomputing the kernel every time that we 
change the distribution of interest on the trajectory space. 
Standardization. To use the 
same kernel of po we need to 
standardize the trajectories so 
that they have the same scale 
as our base measure. Standard- 
ization, by subtracting to each 
variable its sample mean and di- 
viding by its sample standard as 
deviation, will result in a sim- ve) custom kerenel 
ilar range of values as that of ol Ot h 
trajectories sampled from wp, a 
thus removing distortions due to Fig. 5. Expected robustness prediction using the 
the presence of different scales kernel o E n T ts pee ia 
: a custom kernel. e depic as a function 
aces tan mre of the bandwidth o of the Gaussian kernel (with 


lice thocepenecdied bythe Sil, both axes in logarithmic scale). 

sampling algorithm. 

Performance of base and custom kernel. We consider three different stochas- 
tic models: Immigration (1 dim), Isomerization (2 dim) and Polymerise (2 dim), 
simulated using the Python library StochPy [22] (see also [10], Appendix D.5). 


— Immigration 


Isomerization 


104 Polymerase 


MSE 
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We compare the performance using the kernel evaluated according to the base 
measure uo (base kernel), and a custom kernel computed replacing uo with the 
measure on trajectories given by the stochastic model itself. Results show that 
the base kernel is still the best performing one, see Fig. 5. This can be explained 
by the fact that the measure uo is broad in terms of coverage of the trajectory 
space, so even if two formulae are very similar, there will be, with a high prob- 
ability, a set of trajectories for which the robustnesses of the two formulae are 
very different. This allows us to better distinguish among STL formulae, com- 
pared to models that tend to focus the probability mass on narrower regions of 
T as, for example, the Isomerization model, which is the model with the most 
homogeneous trajectory space and has indeed the worst performance. 
Expected Robustness Setting is the same as for the corresponding experi- 
ment on uo. Instead of the Polymerase model, we consider here a Transcription 
model [22] (see also [10], Appendix D.5), to have also a 3-dimensional model. 
Results of quantile for RE and AE for the normalized robustness are reported 
in Table 1, bottom three rows. The results on the different models are remark- 
ably promising, with the Transcription model (median RE 7%) performing a bit 
worse than Immigration and Isomerization (1.6% and 2.6% median RE). Similar 
experiments have been done also on single trajectories, where we obtain similar 
results as for the Expected Robustness [10], Appendix D.5. 


6 Conclusions 


To enable any learning over formulae, their features must be defined. We circum- 
vented the typically manual and dubious process by adopting a more canonic, 
infinite-dimensional feature space, relying on the quantitative semantics of STL. 
To effectively work with such a space, we defined a kernel for STL. To further 
overcome artefacts of the quantitative semantics, we proposed several normaliza- 
tions of the kernel. Interestingly, we can use exactly the same kernel with a fixed 
base measure over trajectories across different stochastic models, not requiring 
any access to the model. We evaluated the approach on realistic biological mod- 
els from the stochpy library as well as on realistic formulae from Arch-Comp 
and concluded a good accuracy already with a few hundred training formulae. 
Yet smaller training sets are possible through a wiser choice of the training 
formulae: one can incrementally pick formulae significantly different (now that 
we have a similarity measure on formulae) from those already added. Such active 
learning results in a better coverage of the formula space, allowing for a more 
parsimonious training set. Besides estimating robustness of concrete formulae, 
one can lift the technique to computing STL-based distances between stochastic 
models, given by differences of robustness over all formulae, similarly to [14]. To 
this end, it suffices to resort to a dual kernel construction, and build non-linear 
embeddings of formulae into finite-dimensional real spaces using the kernel-PCA 
techniques [27]. Our STL kernel, however, can be used for many other tasks, some 
of which we sketched in Introduction. Finally, to further improve its properties, 
another direction for future work is to refine the quantitative semantics so that 
equivalent formulae have the same robustness, e.g. using ideas like in [23]. 
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Abstract. Aggregated roundoff errors caused by floating-point arith- 
metic can make numerical code highly unreliable. Verified postconditions 
for floating-point functions can guarantee the accuracy of their results 
under specific preconditions on the function inputs, but how to systemati- 
cally find an adequate precondition for a desired error bound has not been 
explored so far. We present two novel techniques for automatically syn- 
thesizing preconditions for floating-point functions that guarantee that 
user-provided accuracy requirements are satisfied. Our evaluation on a 
standard benchmark set shows that our approaches are complementary 
and able to find accurate preconditions in reasonable time. 


1 Introduction 


Floating-point arithmetic as defined by the IEEE 754 standard [18] is widely used 
to approximate real arithmetic in embedded or scientific computing applications. 
While allowing highly efficient computations, the limited precision of floating- 
point numbers introduces roundoff errors in every single operation [24]. The 
aggregated errors in computations where such rounding happens repeatedly are 
challenging to understand and predict intuitively, so that a variety of techniques 
and tools [10,14,11,29,21,22] have been developed that bound worst-case roundoff 
errors. These techniques assume a given floating-point precision, e.g. uniform 
double precision and a precondition ~(Z) that bounds a function’s possibly multi- 
variate parameters (7), and automatically compute an upper-bound € on the 
function result’s absolute roundoff error (ferr(Z))*: 


VE. Y(T) > ferr(®) < € (1) 


Answering the inverse question can be equally useful: given a desired round- 
off error bound and precision, for which inputs will the computation’s result be 


$ Part of this work was done while the author was at KIT, Germany. 

t Part of this work was done while the author was at MPI-SWS, Germany. 

* Part of this work was funded by the AESC project supported by the Ministry of 
Science, Research and Arts Baden-Wiirttemberg (Ref: 33-7533.-9-10/20/1). 

4 We provide more formalization details in the next section. 
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at least this accurate? That is, given a postcondition specifying the error bound 
for a floating-point function’s result, we want to infer a suitable precondition w. 
Such preconditions can be useful for modular verification of larger floating-point 
programs, or for efficient implementations: for inputs that satisfy the generated 
precondition, the function can be evaluated using e.g. efficient double-precision 
floating-point arithmetic, instead of a more accurate but significantly more ex- 
pensive arbitrary-precision arithmetic [2] that would have to be used for the 
remaining input space. 

Outside the analysis of floating-point software, the automatic synthesis of 
preconditions for software components is not a new field of study. Dijkstra’s 
weakest precondition calculus [12], while not originally intended to be used for 
specification inference, can generate weakest preconditions. However, when ap- 
plied to a floating-point function, it creates a precondition that still contains the 
floating-point arithmetic of the analyzed program and is, thus, not simpler than 
the program itself. Recent approaches (targeting non-floating-point programs) 
for specification inference [23,28,7,13] similarly do not attempt to abstract from 
arithmetic operations and their inaccuracies. 

This paper introduces two novel techniques for synthesizing sound and ab- 
stract preconditions for floating-point functions. The inferred preconditions w(Z) 
are sound, by which we mean that they are guaranteed to satisfy Eq. (1) for a 
user-specified error bound £. The preconditions are abstract in the sense that 
they do not contain any floating-point arithmetic operations. 

We choose to synthesize interval-valued preconditions that bound each func- 
tion parameter by a lower and an upper bound, i.e. x € [a,b]. Such preconditions 
avoid floating-point arithmetic, and thus roundoff errors, as evaluating them re- 
quires only comparisons with constants. Our preconditions are relatively simple 
on purpose to ensure compatibility with current sound roundoff verification tech- 
niques that internally rely on interval-based abstractions. While more complex, 
e.g. nonlinear, constraints may be more precise, they are not well-supported by 
state-of-the-art verifiers and thus their benefit would be (currently) lost. 

While we aim to synthesize weak preconditions that cover much of the in- 
put space, weakest preconditions are not necessarily helpful in the context of 
floating-point computations. The reason is that the space of inputs satisfying a 
postcondition—especially one bounding the roundoff error—is in general highly 
discontinuous due to the discrete nature of floating-point arithmetic. A weakest 
precondition would thus consist of a large conjunction, with individual terms 
often covering only a few values, and would hence not be practically useful. 
Instead, we aim to find preconditions that balance precision (are as weak as 
possible) and complexity (are simple and can be evaluated efficiently). 

We are not aware of an existing approach for generating such sound floating- 
point preconditions; we thus choose to introduce and explore two quite different 
techniques that build on existing dynamic and static floating-point analyses in 
a novel way. Both approaches start by dynamically sampling the analyzed func- 
tion in order to find likely precondition candidates and then use a verification 
backend to refine them until their soundness can be guaranteed. The first recur- 
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sive subdivision approach does this by recursively subdividing the input space 
into increasingly smaller cells, discarding those where sampling shows that the 
postcondition is not satisfied for the contained inputs, and attempting to verify 
the rest. Since such generated preconditions may still contain a large number 
of discontinuous subdomains, we further present an optimization algorithm that 
soundly approximates the preconditions with significantly simpler expressions 
that can be evaluated more efficiently. The second classification tree approach 
learns areas of inputs for which the postcondition holds based on a classification 
tree learned from the dynamic samples, and iteratively refines verified precon- 
ditions in these areas. 

Our approaches guarantee soundness of the generated preconditions by veri- 
fying each individual interval domain in the preconditions using a sound floating- 
point roundoff error analyzer. Our approach is generic in the choice of this tool; 
we integrate the floating-point verification framework Daisy [10]. 

We evaluate and compare our proposed approaches on benchmarks from 
the standard floating-point benchmark suite FPBench [8] and show that the ap- 
proaches are able to find adequate preconditions that (1) are syntactically simple 
and cheap to evaluate and (2) are relatively weak, i.e. good approximations of 
the weakest preconditions covering large areas of the input space, thus balancing 
complexity and permissiveness. For most benchmarks, our approaches find pre- 
conditions in under 20 minutes (and often significantly faster). We demonstrate 
a possible application of our inferred preconditions for performance improve- 
ments on a case study using a kernel from a real-world material sciences code 
that inspired this work. 


Contributions In summary, this paper makes the following contributions: 


— Two independent novel inference algorithms that generate interval-valued 
preconditions for floating-point functions. They are the first of their kind. 

— An open-source implementation of both approaches as part of the Daisy 
floating-point analysis framework. 

— An extensive evaluation on 99 benchmarks and a case study showing the 
effectiveness of our precondition inference. 


2 Overview 


Before explaining our approaches in detail, we provide a high-level overview 
using an example. Consider the two-dimensional function himmilbeau from the 
floating-point benchmark suite FPBench [8], introduced to evaluate optimization 
algorithms [16], and defined as 


f(£1, £2) = (£? + £2 — 11)? + (x1 + £2 — 7)? . 


We denote by f : R” — R the ideal, real-valued specification of the func- 
tion that a developer may want to compute (where n is the number of function 
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arguments, n = 2 for our example). While such a function can in principle be im- 
plemented exactly, e.g. using rational arithmetic, such an evaluation is generally 
slow. Hence, in practice, the function would be implemented in finite precision. 
In this paper, we consider double-precision (64 bit) IEEE 754 [18] floating-point 
arithmetic, which is one of the most commonly used finite precisions (though 
our approach generalizes to other floating-point precisions as well). We denote 
this finite-precision implementation by f : F” —> F. 

When evaluating f, each computed intermediate value has to be potentially 
rounded to a value that is representable in finite precision, introducing a roundoff 
error. While each roundoff error individually is (usually) small, the errors propa- 
gate and accumulate during the computation, resulting in potentially large errors 
on a function’s result [20]. It is thus important to be able to make statements 
about this error, for instance as an absolute error: ferr (Z) = |f(Z)—f(#)|,Z € F”, 
where we assume that 7 are ‘finite’ values and not one of the Not-a-Number or 
Infinity special floating-point values. Our approach assumes and proves that 
all computations remain within the number ranges of the chosen floating-point 
precision and that special values never occur during expression evaluation. 

In this paper, we aim to synthesize an interval-valued precondition (Z) that 
satisfies Eq. (1) (Yz. Y(T) > ferr(£) < €) where ~ is of the form: 


VV VAN xi E [ak,i; bki] 


k=1 i=1 


I.e. such a precondition represents the (set-theoretic) union of m domains of 
dimension n. To obtain a precondition that can be efficiently checked, we aim 
to keep m small (< 10), while the precondition should nonetheless be as weak 
as possible, i.e. cover as much of the input space as possible. 

Our precondition inference starts from an initial search area which may be 
either specified by the user, be defined, for example, by an embedded sensor 
output domain, or be computed by a static analysis on the call site(s) of f. For 
our himmilbeau example, we assume 21,22 € [—20,20] as the search area, and 
€ = 1.4211e-12 as the target error bound. 

In the first step, our approach samples inputs from the initial search area at 
random, and evaluates the function f on each input in double precision arith- 
metic and approximates its corresponding specification f using 128 bit arbitrary- 
precision arithmetic [2]. Comparing the results from the double- and higher- 
precision evaluations gives us an estimate of the roundoff error. We use this 
estimate to mark each input as valid or invalid, i.e. as satisfying or violating 
the postcondition, respectively. Fig. 1 shows the valid and invalid samples for 
our running example in blue and red, respectively. Note that the error bounds 
obtained from these samples do not have to be sound, as they are used only for 
guiding the precondition search; our technique will use static analysis to verify 
each precondition candidate soundly. Furthermore, the sampling also does not 
need to identify the exact bounds between valid and invalid samples. As Fig. 1 
indicates, such bounds would lead to highly discontinuous preconditions that 
would be of limited practical use. 
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Fig. 1: The sampled himmilbeau function. Blue and red points indicate valid and 
invalid input values, respectively. The rectangles show the inferred preconditions. 


Starting from these samples, we explore two techniques. First, we use inter- 
val subdivision to subdivide the initial search area into equal interval regions 
(domains such that every dimension is bounded by an interval), and then check 
each region individually using sound static analysis for whether it is a valid part 
of the precondition. Fig. 1a shows the generated precondition in green. To reduce 
the number of regions in the precondition for a simpler and more efficient pre- 
condition, we propose an optimization algorithm that approximates the initial 
verified precondition with fewer, larger regions; the result of this optimization is 
shown in Fig. 1b. 

Subdivision may be inefficient when only a small part of the initial search 
area constitutes a valid precondition. We thus further explore an approach based 
on classification tree learning that starts from the valid and invalid samples and 
learns an initial candidate precondition, or a set of candidates if the space of 
valid samples is disjoint. Then, we again use static error verification to search 
for sound preconditions. Fig. 2a shows the generated precondition in green. 

Ultimately, an inferred precondition allows us to refactor floating-point pro- 
grams such that they use computations in floats if the result is known to be 
accurate, and resort to high-precision libraries otherwise. For example, a C- 
implementation of the himmilbeau example using the precondition from Fig. 1b, 
achieves a 8.6% speed-up against a pure high-precision implementation (on ran- 
domly chosen inputs from the range [—20, 20]). The precondition that triggers 
the optimization covers 11.5% of the input domain, hence the size of a precon- 
dition nearly directly translates to performance improvements. 

The inferred precondition will in general be stronger than the weakest pos- 
sible precondition, i.e. our inferred preconditions do not cover all of the blue 
points in Fig. 1 and Fig. 2. There are several reasons: The verification backend 
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Fig. 2: Inferred preconditions for himmilbeau using the classification tree ap- 
proach for the error postcondition, and subdivision for the range postcondition. 


has to rely on abstractions and can thus not always verify a valid precondition 
candidate. Furthermore, due to runtime considerations of our algorithm, the 
approaches cannot operate on arbitrarily detailed intervals. 

Finally, while we discussed our precondition inference for postconditions that 
target an error bound, our approach equally works for postconditions that specify 
a target range, e.g. that require that the value of the result of our himmilbeau 
function is within given bounds (f(z) € [—100, 100]). We show the precondition 
inferred for this case using subdivision and subsequent optimization in Fig. 2b. 


3 Precondition Inference by Subdivision 


The first approach that we propose finds preconditions by recursively splitting 
the initial search area along the parameter axes until it finds interval domains 
for which the verification backend is able to prove that the target postcondition 
holds for all inputs. This approach is inspired by interval subdivision that is 
being used, for example, in roundoff error bound analysis to reduce the amount 
of over-approximations due to abstractions. 

However, a naive application of subdivision for precondition inference is not 
practical. Each parameter’s interval has to be subdivided several times in order 
to find verifiable preconditions, leading to a large number of regions especially 
for multi-variate functions. If we then run the relatively expensive verification 
procedure on each of these regions, the overall running time quickly becomes 
unreasonable. Furthermore, a precondition consisting of a large number of small 
interval regions is inefficient to evaluate and unwieldy. We thus combine static 
and dynamic verification (Sec. 3.1), and optimize the generated preconditions to 
yield more compact representations (Sec. 3.2). 
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Fig. 3: Illustration of recursive subdivision in two dimensions. 


Algorithm 1 Recursive Subdivision 


1: given arithmetic expression expr, postcondition post 

2: procedure EXTRACTPRE(node) 

3: if node € valid then 

4: if verify(node.region, expr, post) then 

5: return node.region 

6: if node is a leaf then return 

7 else return EXTRACTPRE(n.left) U EXTRACTPRE(n. right) 


3.1 Extracting a Verified Precondition from Subdivisions 


Our approach starts by building a binary tree, where each node represents an 
interval region in the search area. The tree is generated by recursively splitting 
intervals along one parameter axis into two equally sized intervals (called left 
and right), splitting along each parameter axis in turn. The top part of Fig. 3 
illustrates this subdivision for a two-dimensional example and with a maximum 
subdivision depth of 4. From left to right, the nodes are repeatedly subdivided 
until there are 16 leaf nodes. 

Our algorithm then runs dynamic sampling (as described in Sec. 2) for each 
leaf node J. A node l is marked as valid (blue check marks in Fig. 3) if the post- 
condition is satisfied for all samples, and as invalid (red cross marks) otherwise. 
The middle part of Fig. 3 shows how these markers ascend to the root of the 
tree: An inner node i is marked valid if and only if both of its children are valid: 
i € valid © (i.left € valid A i.right € valid). 

Next, our approach performs a recursive descent (shown in Algorithm 1) from 
the root node to extract the precondition. The verification backend is queried 
(verify in the algorithm) to verify that intervals are valid (sound) preconditions 
for all inputs in a given region. As a heurisitic, verification is attempted as close 
to the root of the tree as possible, as thus a single verification attempt can verify 
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Fig. 4: Approximating generated preconditions. 


a larger volume. On the other hand, the verification is more likely to fail, which 
may increase running time of the algorithm. Verification is futile and thus not 
attempted for an invalid node (node ¢ valid). In this case, or if the verification 
back-end fails to verify, the procedure descends further down the tree. 

The bottom part of Fig. 3 illustrates this procedure. No verification is at- 
tempted on the root node and its first degree children as they are invalid. Ver- 
ification is attempted for the two valid grandchild nodes of the root that were 
marked with a blue check mark. For the lower right node verification is success- 
ful, so there is no need to further descend to its child nodes. Verification fails for 
the left one, which means it has to be subdivided again, like its two remaining 
sibling nodes. Sometimes subdivision is needed to verify a region even if all of it 
is ultimately verifiable, such as the lower left region in the last subdivision step. 
The reason for this is that subdivision generally reduces over-approximations 
due to the abstractions that the sound verification procedure relies on, and thus 
often allows to compute tighter error bounds [10]. 

The maximum subdivision depth controls the precision of the approach. 
With larger depth, the generated preconditions can have a larger volume, i.e. be 
weaker, but this comes at the cost of a longer running time of the algorithm. 

The union of all valid regions extracted from the tree is returned as a pre- 
condition. This precondition is sound, since each region has been verified by a 
sound roundoff error analysis. 


3.2 Precondition Optimization 


Depending on the subdivision depth, the number of individual regions in a gen- 
erated precondition can easily reach into the thousands. We observed that one 
can often approximate the result with significantly fewer regions, while only 
marginally reducing their volume. Fig. 4 shows an example precondition gen- 
erated by subdivision on the left, and the optimized precondition on the right. 
The precondition on the right needs only two regions instead of 8, and covers 
most of the originally generated precondition and is thus only slightly stronger. 

Note that simply picking the largest individual interval regions from the gen- 
erated precondition is in general insufficient: larger regions may be found within 
the verified area by composing parts of different intervals into larger ones. While 
one could in principle use simplification algorithms inside constraint solvers for 
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Fig. 5: A sample classification tree and the extracted precondition candidate 


this task, such algorithms are not targeting our use-case, i.e the smallest formula 
that covers the biggest valid region. 

Thus, we propose an optimization that starts by identifying the interval re- 
gion that covers the largest verified area and that possibly (partially) covers 
several interval regions from the originally generated precondition. It then it- 
eratively repeats this process and keeps adding regions that provide the most 
additional coverage. Since our algorithm is greedy, it is not guaranteed to find 
an optimal solution, but our experiments have shown that the approximation is 
very decent even for small numbers of representing regions. Since only regions 
covering verified areas are added, the optimized precondition is sound. This op- 
timization is also fast compared to the rest of the procedure, since it does not 
run roundoff verification. 

This precondition optimization step can be applied on preconditions obtained 
from both inferences approaches (recursive subdivision and the refinement ap- 
proach from the upcoming section), but the effects are more pronounced for the 
subdivision approach as it usually produces results with more individual regions. 


4 Precondition Inference by Decision Tree Learning 


Our second precondition inference technique leverages the dynamic samples in 
a different way: it uses them to generate initial precondition candidates using 
decision tree learning [4], a well-known algorithm in supervised machine learning. 
These candidates are subsequently refined to obtain sound preconditions. We 
consider two such refinements in Sec. 4.2 and Sec. 4.3. 


4.1 Extracting Candidates from a Classification Tree 


First, our algorithm samples the search area as described in Sec. 2, and marks 
each sample as valid or invalid depending on whether or not it satisfies the 
postcondition. The marked or ‘classified’ samples serve as the training data to 
train a classification tree (CT) using decision tree learning. A CT is a binary 
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(c) 


Fig. 6: Illustration of a single candidate (blue rectangle) refinement 


tree in which the inner nodes are tests on the data and each leaf is labeled with 
a category. To classify an individual input, one follows the path given by the 
tests in the CT and obtains the label of the reached leaf as an answer. 

We use CTs to find a simple classification that separates the valid from the 
invalid samples. Fig. 5a shows such a CT for our example himmilbeau function. 
Note that all tests in the CT are comparisons between a variable and a con- 
stant. From this CT, we can extract representations for the category valid by 
enumerating all paths from the root to valid leaves and collect (i.e. conjoin) all 
conditions (resp. their negation for negative edges). Due to the choice of simple 
comparisons with constants for tests, the result can be expressed as bounds on 
the input variables, which describes a set of interval regions. Fig. 5b shows the 
(simplified) precondition candidates extracted from Fig. 5a. 


4.2 Refining Candidates by Growing Regions 


Heuristics are applied when training CTs, and the classification has only been 
obtained from a set of few random samples. It is hence very likely that the 
candidates still contain inputs for which the desired postcondition does not hold. 
They need to be processed to obtain valid preconditions. 

Fig. 6 illustrates our first candidate refinement process. The outer blue square 
represents the initial candidate. The verification backend is used to identify 
regions within it that verifiably are preconditions, shown as filled green rectangles 
in the figure. First, a small initial region in the center of the candidate is grown as 
much as possible without losing verifiability (Fig. 6a). When the maximal region 
has been found, additional precondition regions are inferred along the boundary 
of the region (Fig. 6b). To this end, extension candidates (two examples are 
shown as red rectangles) are identified as the largest possible regions to add 
in particular directions. The mentioned growing mechanism infers maximum 
regions within the extension candidates. For every added region, the extension 
process is repeated (Fig. 6c) until a maximum refinement depth has been reached. 

Algorithm 2 shows the pseudocode procedure REFINECANDIDATE returning 
a verified precondition for a candidate region. The algorithm keeps a set M of 
extension candidates and searches for the largest verifiable region inside each 
extension candidate using BINSCALESEARCH (binary search on interval regions) 
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Algorithm 2 Candidate Refinement 


1: given arithmetic expression expr, postcondition post, binary search depth d 
2: procedure REFINECANDIDATE(region) 
result +— Ú 
M + {(CENTER(region), region)} 
while M 4 ý do 
choose (min, max) € M and remove 
verified < BINSCALESEARCH(min, maz) 
if verified 4 ý then 
result + result U { verified} 
M + M U GENEXTENSIONCANDIDATES (verified, maz) 


11: return result 


which invokes the verification backend. The procedure CENTER computes the 
center of a region used as the starting point for growing an initial solution, and 
GENEXTENSIONCANDIDATES produces new extensions candidates (in form of 
min/max pairs of regions) to be explored. 

In the implementation, the set M is realized as a priority queue favoring 
potential additions far from the original candidate’s border that can thus grow 
easily, and the number of iterations is bounded by a configurable parameter. 


4.3 Refining Candidates by Recursive Subdivision 


Instead of this refinement approach for precondition candidates, the subdivision 
technique from Sec. 3 can alternatively also be applied to obtain valid precondi- 
tions from candidates. The candidate production using a CT then serves as a first 
step narrowing an initial search region to a smaller region in which subdivision 
can operate productively, in particular because a finer mesh can be applied on 
the interesting regions, which is better for verification with the backend verifier. 


5 Evaluation 


Implementation We implemented both precondition inference approaches in the 
open-source tool Daisy [10], building on the static range and error analyses that 
Daisy provides. In particular, we use Daisy’s interval analysis for computing real- 
valued ranges and affine arithmetic for computing roundoff error bounds. We use 
the DecisionTree class from the Smile library [1] for classification tree learning. 
Empirically, we have identified the following default parameters that produce 
good results on the benchmarks on average, while not being prohibitive for larger 
benchmarks: we limit the maximum depth for classification tree learning to 8 and 
the depth for binary search during refinement in the classification tree approach 
to 10. When combining classification tree learning with subdivision, we limit the 
decision tree depth at 12. We use 8192 samples for classification tree learning 
and 16 samples per subdivided region for our subdivision approach. 
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Benchmarks We evaluate our precondition inference approaches on benchmarks 
from the benchmark suite FPBench [8] that is widely used in the floating-point 
research community. Each benchmark consists of an arithmetic expression and 
typically comes with a precondition specifying the input domain of the expres- 
sion. For a few benchmarks where no input domain is given, we add one manually. 
For our evaluation, we require a postcondition to be given that specifies a target 
error bound or a range. Since these are not provided by FPBench as-is, we gen- 
erate them for our experiments as follows. We compute error bounds and result 
ranges based on the existing original input domains as specified in FPBench, 
and use these as two separate target postconditions. We exclude benchmarks for 
which Daisy is not able to compute errors or ranges, e.g. because they contain 
conditional statements. In total, we generate a set of 99 benchmarks with post- 
conditions specifying an error bound, and a separate set of 99 benchmarks with 
postconditions specifying a target range, with the following dimensionalities: 


dimension 1 2 3 4 6 8 9 
# benchmarks 33 29 16 4 12 1 4 


Baseline In the absence of existing tools for floating-point precondition inference 
or the ground truth®, we compare the preconditions inferred by our approaches 
against the original preconditions specified in FPBench. Indeed, the original 
precondition from FPBench is—by construction—a valid precondition. 

We measure the quality of an inferred precondition as a relative volume, i.e. 
the ratio of the volume of the generated precondition over the volume of the orig- 
inal precondition. A relative volume greater than one is obtained if the original 
domain specification is strong and the approaches discover valid preconditions 
beyond the original specification. For many benchmarks, however, obtaining a 
relative volume close to one is close to the optimal result. (Measuring the abso- 
lute volumes is not meaningful as they are highly benchmark dependent.) 


Setup Our techniques rely on an initial search area provided by the user. While 
it may be convenient if our algorithms considered an unbounded initial space, 
i.e. all possible floating-point values, this is practically infeasible. The valid pre- 
condition typically covers only a very small part of this ‘unbounded’ domain, 
and it would thus be computationally very expensive to search for. 

For our evaluation, we consider two sets of initial search areas: We use the 
original domain specified in FPBench scaled uniformly around their centers to 
contain 100 times the original volume, and we use a large fixed initial domain 
for all benchmarks bounding all input arguments in [—10°, 10°]. For both initial 
search areas, it is unlikely that the entire area would be a valid precondition. 


Comparison of Approaches Simply comparing the relative volume of the precon- 
ditions does not consider that each approach would be able to produce bigger 
preconditions by investing more computational effort. Conversely, the running 


5 The exact ground truth would be highly discontinuous, and would require sampling 
of all floating-point inputs, which is infeasible for double precision. 
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Fig. 7: Summary statistics precondition optimization 


times cannot be compared in isolation. Thus, we compare the relative volumes 
of generated preconditions per invested time®. We use a timeout of 20 minutes 
for each benchmark and parameter setting. 

We consider our effectively three approaches: subdivision, tree refinement 
(with growing candidates), and tree refinement with subdivision, that we call 
hybrid for the sake of this evaluation. For this comparison, we initially do not 
use the precondition optimization from Sec. 3.2, and evaluate it separately. We 
observe that for the subdivision and the hybrid approach, the maximum depth 
of the subdivision tree significantly affects the running time of the algorithm. For 
the tree refinement, the most relevant parameter is the number of refinement can- 
didates considered for the growing-based refinement. We thus vary these parame- 
ters and keep all others to the default values given in Sec. 5. In total, we run 3762 
experiments using the scaled and 1782 experiments using the fixed search area. 

Fig. 7 summarizes our results. ‘TO’ counts the number of times an individual 
run timed out. ‘Fail’ means that no precondition was found by a search strategy 
for any of the tested parameters. ‘Best’ counts the number of benchmarks for 
which an approach was able to find the best (weakest) precondition (with any 
parameter setting); when the numbers do not add up to 99, it is due to ties. 

Clearly, our precondition inference is more effective for the scaled search area 
benchmarks; it is able to find preconditions in nearly all runs. However, it is 
able to find some preconditions even for the very large area, where the verifiable 
regions are often vanishingly small. Also, we observe that no one approach is 
universally better than the others, as each is best on some set of benchmarks. 

Fig. 9 visualizes the relative volumes of generated preconditions by the dif- 
ferent approaches per running time of the algorithm, for benchmarks where the 
postconditions bound the roundoff error and for the scaled input search ar- 
eas. Each point corresponds to one parameter setting. Fig. 9a averages over all 
benchmarks, whereas Fig. 9b averages only over benchmarks where the gener- 


6 We ran all experiments on a Mac mini with an 6-core Intel i5 processor at 3 GHz 
with 16 GB RAM running macOS Catalina. 
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Fig. 9: Comparison of approaches without optimization: average relative volume 
per time (seconds) for error postconditions and 100x search area. 
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Fig. 10: Comparison of approaches without optimization: average relative volume 
per time (seconds) for range postconditions and 100x search area. 


ated precondition was small, i.e. at most 1.2 times the original precondition. We 
show the analogous plots for the range postconditions in Fig. 10. 


We observe that averaged over all benchmarks, the subdivision and hybrid 
approaches perform significantly better than the tree refinement approach. In 
fact, our techniques are able to identify preconditions that are, on average, sig- 
nificantly larger than the original precondition. If we consider only those 33 
benchmarks, where only a relative small precondition was generated, we see 
that tree refinement shows the, on average, best benefit. For our range bench- 
marks (Fig. 10), we observed on average a slight benefit of the hybrid approach 
for small preconditions. Note that even when ‘small’ preconditions are generated, 
they nearly cover the entire input search area, i.e. our precondition inference is 
able to recover most of the original preconditions. 
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Precondition Optimization Finally, we evaluate the effectiveness of the precondi- 
tion optimization on the subdivision and hybrid approach (we have not observed 
the optimization to be particularly useful for tree refinement). For this evalua- 
tion, we fix a particular parameter setting that achieves a good trade-off between 
relative volume of inferred preconditions and running time of inference. Then 
we vary the number of target regions that the optimization should produce. On 
average, the preconditions generated for this experiment consisted of 120 dis- 
tinct regions before optimization. For each run, we compute the coverage of the 
optimized precondition, i.e. the ratio of the optimized over the non-optimized 
inferred precondition. Fig. 8 visualizes the results of this experiment as a cactus 
plot where we sort the runs by coverage. For example, the value 0.27 for 1 re- 
gion at the 20th percentile means that in 80% of the runs, the coverage of the 
optimized precondition was at least 0.27. As expected, the more regions are al- 
lowed, the better the coverage of the optimized preconditions becomes. Overall, 
we see that our inference with optimization is able to generate relatively simple 
preconditions (i.e. with just a few regions) in reasonable time that nonetheless 
cover large parts of the verifiable area for many of the benchmarks. 


Case Study We demonstrate the benefits of our precondition inference on a prac- 
tical problem that inspired this work. We consider the 9-dimensional function 
to calculate the scalar triple product a - (8 x y) of three 3-dimensional vectors 
a, 8,7 € RÌ, based on the requirements of an assumed use case: each parameter 
will be within a range of [—1337, 1337], and we require the error of the result to 
be at most 3-10~°. This use case arose in a convex hull algorithm for scientific 
computing in material sciences. 

Running the recursive subdivision approach for this expression with a sub- 
division depth of 14 and 262144 samples yields the following results: In roughly 
13 minutes, the approach produces a precondition that covers about 67 percent 
of the search area and consists of 4608 individual intervals. In another 112 sec- 
onds, the optimization algorithm produces a precondition consisting of only two 
intervals which together cover 51% of the verified area and 34% of the search 
area. Using this optimized precondition, we can create a hybrid implementa- 
tion of the original function, which decides whether to use a (exact) rational or 
floating-point version dynamically. Even with the added overhead from checking 
the precondition, the required runtime reduces from 17.13s for a purely rational 
implementation to 10.77s for the hybrid implementation for running the function 
100000 times with random inputs from the input space. A similar speedup can 
be observed when using a higher precision floating-point implementation instead 
of an exact rational implementation in case the precondition does not hold. 


6 Related Work 


The precondition synthesis approaches presented in this work rely on state-of- 
the-art floating-point verification and analysis tools to verify precondition can- 
didates and guarantee their soundness. While we have used the Daisy framework 
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[10] as a verification backend, any tool able to calculate sound bounds for errors 
or result ranges of floating-point functions could be used instead: Fluctuat [14], 
Gappa [11], FPTaylor [29], Real2Float [21] and PRECISA [22]. 

We are not aware of an existing technique that can generate sound precon- 
ditions for floating-point functions. The closest related techniques are optimiza- 
tions that identify certain parts of the input domain, for which a rewriting of 
the input program results in a smaller roundoff error [26,32,30]. These rewrit- 
ings are based on real-valued identities, leveraging the fact that floating-point 
arithmetic is e.g. not associative, or polynomial approximations. The split of the 
input domain can be viewed as a kind of precondition, however, the goal and 
guarantees provided are very different. The aim is to identify and repair large 
roundoff errors, whereas our approach tries to identify the input domain with 
reasonable errors. Furthermore, all of the techniques rely on dynamic analysis 
and thus do not provide soundness guarantees. 

Dynamic analysis is frequently being used to estimate the magnitude of 
roundoff errors [3], and several works have developed a targeted search towards 
inputs that cause particularly large errors [31,6,33], in order to identify worst- 
case errors. Our precondition inference combines dynamic and static analysis in 
a novel way in that the dynamic analysis serves a pre-processing step to explore 
the input domain. As such, the goal of our dynamic analysis is different from ex- 
isting ones, as we want it to explore the input domain evenly, instead of focusing 
on a (possibly small) part of the input domain with large errors. 

One possible use of our inferred preconditions is to be able to generate im- 
plementations that choose an efficient floating-point precision whenever possi- 
ble, and otherwise use some ‘safe’ higher precision. In that, our approach is 
related to mixed-precision tuning techniques that mostly focus on implementa- 
tions that mix single, double and quad floating-point precision. Some of these 
use dynamic analysis to estimate errors and thus do not provide sound guar- 
antees [25,19,17,15], and others use static analysis with accuracy guarantees, 
but less scalability [5,9]. Mixed-precision tuning generally works well when the 
target error bounds are close to the error bounds of uniform-precision imple- 
mentations [9,27]. We consider mixed-precision tuning complementary to our 
precondition inference; for instance, preconditions generated by our approaches 
could be used as a starting-point for mixed-precision tuning. 


7 Conclusion 


We have presented the first precondition inference techniques from floating-point 
accuracy and range postconditions, using a combination of dynamic and static 
analysis. Each of the three approaches that we explored generate good results 
from reasonably sized initial search areas with acceptable computational effort 
and have different strengths and weaknesses; neither approach is universally 
better than the others. One of the main challenges for future work is to improve 
the identification of preconditions when the initial search areas are very large, 
which we have identified as a particular challenge. 
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Abstract. We present NeuReach, a tool that uses neural networks for 
predicting reachable sets from executions of a dynamical system. Unlike 
existing reachability tools, NeuReach computes a reachability function 
that outputs an accurate over-approximation of the reachable set for any 
initial set in a parameterized family. Such reachability functions are use- 
ful for online monitoring, verification, and safe planning. NeuReach imple- 
ments empirical risk minimization for learning reachability functions. We 
discuss the design rationale behind the optimization problem and estab- 
lish that the computed output is probably approximately correct. Our ex- 
perimental evaluations over a variety of systems show promise. NeuReach 
can learn accurate reachability functions for complex nonlinear systems, 
including some that are beyond existing methods. From a learned reach- 
ability function, arbitrary reachtubes can be computed in milliseconds. 
NeuReach is available at https://github.com/sundw2014/NeuReach. 


Keywords: Reachability analysis - Data-driven methods - Machine learn- 
ing 


1 Introduction 


Reachability has traditionally been a fundamental building block for verification, 
monitoring, and prediction, and it is finding ever-expanding set of applications 
in control of cyber-physical and autonomous systems [19,23]. Reachtubes cannot 
be computed exactly for general hybrid models, but remarkable progress over the 
past two decades have led to approximation algorithms for nonlinear and very 
high-dimensional linear models (See, for example, [{11,18,5,3,25,12,1,26,34]). All 
of these algorithms and tools compute the reachtube from scratch, every time 
the algorithm is invoked for a new initial set Xo, even if the system model does 
not change. This is a missed opportunity in amortizing the cost of reachability 
over multiple invocations. All the applications mentioned above, like verification, 
monitoring, and prediction, indeed use multiple reachtubes of the same system, 
but from different initial sets. 


* The authors were supported by research grants from the National Security Agency’s 
Science of Security (SoS) program and National Science Foundation’s Formal Meth- 
ods in the Field (FMITF) program. 
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In this paper, we present NeuReach, a tool that learns a reachability function 
from executions of dynamical systems. With the learned reachability function, 
for every new initial set a corresponding reachtube can be computed quickly. To 
use NeuReach, the user has to implement a simulator function of the underly- 
ing dynamical (or hybrid) system for generating trajectories, and several other 
functions for sampling initial sets. As output, the tool will generate a function 
which can be serialized and stored for repeated use. This function takes as input 
a query which is an initial set Vp and a time instant t, and outputs an ellip- 
soid, which is guaranteed to be an accurate over-approximation of the actual 
reachable set. 

Formally, NeuReach solves a probabilistic variant of the well-studied reach- 
ability problem: the problem is to compute a reachability function R(-,-) for 
a given model (or simulator), such that for any initial set Yo and time t, the 
output of the function R(Xo, t) is an over-approximation of the actual reachset 
from Xo at time t. That is, R is computed once and for all—possibly with an 
expensive algorithm—and thereafter, for every new initial set Xo and time t, the 
reachset over-approximation R(%,t) is computed simply by calling R. Thus, it 
enables online and even real-time applications of reachset approximations. 

NeuReach computes reachability functions using machine learning. We view 
this as a statistical learning problem where samples of the system’s trajectories 
have to be used to learn a parameterized reachability function Rg(-,-). Because 
the trajectory samples are the only requirements from the underlying dynamical 
system to run NeuReach, it can be applied to systems with or without analyt- 
ical models. In this paper, we discuss how the above problem can be cast as 
an optimization problem. This involves carefully designing a loss function that 
penalizes error and conservatism of the reachability function. With this loss func- 
tion, it becomes possible to solve the problem using empirical risk minimization 
and stochastic gradient descent. For the sake of justifying our design, we de- 
rive a theoretical guarantee on the sample complexity using standard statistical 
learning theory tools. 

We evaluate NeuReach on several benchmark systems and compare it with 
DryVR [21] which also uses machine learning for single-shot reachset computa- 
tions. Results show that, with the same training data, NeuReach generates more 
accurate and tighter reachsets. Using NeuReach we are able to check the key 
safety properties of the challenging F-16 benchmark presented in [28]. To our 
knowledge, this is the first successful verification of at least some scenarios in 
this benchmark. Furthermore, as expected, once R(-,-) is computed, it can be 
invoked to rapidly compute reachsets for arbitrary Xo and t. For example, esti- 
mating a reachset for an 8-dimensional dynamical system with an NN-controller 
only takes ~ 0.3 milliseconds. This makes NeuReach attractive for online and 
real-time applications. 


Contributions. (1) We present a simple but effective and useful machine- 
learning algorithm for learning reachability functions from simulations. With 
the learned reachability function, accurate over-approximation of the reachable 
set for any initial set in a parameterized family can be quickly computed, which 
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enables real-time safety check and online planning; (2) We derive a probably 
approximately correct (PAC) bound on the error of the learned reachability 
function (Theorem 1) using techniques in statistical learning theory; (3) We eval- 
uate the proposed tool on several benchmark dynamical systems and compare 
it with another data-driven reachability tool. Experiments show that NeuReach 
can learn more accurate and tighter reachability functions for complex nonlinear 
and hybrid systems, including some that are beyond existing methods. 


2 Related work 


Reachability analysis for models with known dynamics. This category 
of approaches consider the reachability analysis of models with known dynamics 
(i.e., white-box models). This is an active research area, and there is an ex- 
tensive body of theory and tools on this topic [11,2,15,5,25,33,27,16,38,10,39]. 
Reachability analysis is hard in general. Exact reachability is undecidable even 
for deterministic linear and rectangular models [29,24]. For dynamical models de- 
scribed with ordinary differential equations (ODE), Hamilton—Jacobi—Bellman 
(HJB) equations can be used to derive the exact reachable sets [30,6,7]. An 
HJB equation is a partial differential equation (PDE). Solutions of this PDE 
defines the reachabiltiy of the underlying dynamical system. However, solving 
HJB equations is difficult, and such approaches do not scale to high-dimensional 
systems. In practice, the exact reachable set might be unnecessary. For example, 
over-approximations of the reachable sets could suffice for safety check purpose. 
To this end, many approaches and tools have been developed. For example, 
Flow* [11] uses the technique of Taylor model integration to compute over- 
approximations of the solution of an ODE. 

Another series of work [22,20] leverage the sensitivity analysis of ODE to 
bound the discrepancy of solutions starting from a small initial set, and thus can 
compute an over-approximation of the exact reachable set. In [12], a Lagrangian- 
based algorithm is proposed, which makes use of the Cauchy-Green stretching 
factor derived from an over-approximation of the gradient of the solution-flows 
of an ODE. All of the above approaches consider set-based reachability analysis. 


Data-driven reachability analysis. In the cases where the exact dynamics 
of the systems is unknown or partially known, the above approaches cannot be 
applied. One straight-forward direction is to learn the reachability from behav- 
iors [42] of the dynamical system. Several approaches have been proposed for 
reachability only using simulations of the underlying system. These approaches 
include scenario optimization [14,44], sensitivity analysis [21], Gaussian pro- 
cesses [13], adversarial sampling [32,9], etc. 

NeuReach falls in the category of approaches that use randomized algorithms 
for reachability analysis of deterministic (and not stochastic) systems. Another 
member in this category is the scenario optimization approach presented in [14]. 
Different from NeuReach, this method learns a single reachset for a fixed initial 
set and time interval instead of a mapping from arbitrary initial sets and time 
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to the reachsets. Another approach based on scenario optimization is presented 
in [44]. This method computes a fixed-width reachtube by learning a function 
of time to represent the central axis of the reachtube. Moreover, it uses polyno- 
mials with handcrafted feature vectors for learning, which requires case-by-case 
design and fine-tuning. In contrast, our method learns a more flexible reacha- 
bility function using neural networks and avoids the use of handcrafted feature 
vectors. DryVR [21] computes the reachtubes based on sensitivity analysis. It 
first learns a sensitivity function with theoretical guarantees, and then uses it to 
compute the reachset. Among all these tools or methods, we found that Dry VR 
is the only one that has a publicly available implementation. Thus, we compared 
NeuReach with DryVR. 


Neural networks for reachability analysis. Applications of machine learn- 
ing with neural networks for reachability and monitoring has become an active 
research area. The approach in [23] aims to learn the reachtube from data using 
neural networks, with a focus in motion planning. Unlike NeuReach, this ap- 
proach learns the dynamics of the reachtube, and the reachtube can be obtained 
by integrating that dynamics. In [30,7], neural networks are used as a PDE solver 
to approximate the solution of HJB equations. The approach in [36] makes use 
of neural networks to approximate the reachability of dynamical systems with 
control input. In [38,10], the authors develop a framework for runtime predictive 
monitoring of hybrid automata using neural networks and conformal prediction. 


3 Problem setup and an overview of the tool 


NeuReach works with deterministic dynamical systems. The state of the system 
is denoted by x € X C R”. We assume that we have access to a simulator 
function €: ¥ x R59 +> & that generates trajectories of the system up to a time 
bound T. That is, given an initial state ro € Æ and a time instant t € [0,T], 
£(xo,t) is the state at time t.1 

Consider the evolution of the system from a set of initial states (initial set) 
Xo C X. Lifting the notation of € to sets, we write the reachset from Xo as 
E(Xo, t) := Uroex (Lo, t). In general, €(X,t) cannot be computed precisely, and 
thus, we resort to over-approximations of €(4,t) which are usually sufficient for 
verification and monitoring of safety and general temporal logic requirements, 
and also for planning. Beyond computing over-approximations of (Xo, t) for 
a single Xp and t, we are interested in finding a reachability function R:2* x 
(0, T] ++ 2* such that, ideally, €(X,t) C R(X, t) for all valid Xo and t. NeuReach 
implements a solution to this problem which provides a probabilistic version of 
the above guarantee with some restrictions on the shape of the initial set %o. 

In order to discuss the error of a reachability function R, we have to assume 
that its arguments Xo and t are independently chosen according to some dis- 


1 For the sake of simplicity, here we ignore issues arising from quantization and numer- 
ical errors in simulators. Such issues have been extensively studied in the numerical 
analysis and we refer the reader to [17] for a discussion related to verification. 
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tributions P) and Po, ie. Xo ~ Pı and t ~ P. Also, we need a distribution 
function D(-) such that D(X) is distribution over Xo. For example, D(X) could 
be the uniform distribution over æo. Given these distributions, the error of a 
reachability function is defined as: 
aana oa ECOD E R(X, A). (1) 
Here, we assume that the joint distribution of (Xo, t, zo) is defined on the 
Borel o—algebra such that any Borel set is measurable. Given the fact that R is 
continuous? and € as the trajectory of a dynamical system is at least piece-wise 
continuous, the set of all tuples (Xo, t, £o) that satisfy E(xo, t) € R(Xo, t) must 
be a Borel set, and thus is measurable. Therefore, the above probability is well 
defined. 


User interface and data representation. P,, P) and D are specified by the 
user as functions generating samples (explained below). The input and output of 
the reachability function R(%,t) involve infinite objects, and in order to learn 
R, first, we need some finite representations of these objects. In NeuReach, Xo is 
picked from a user-specified family of sets where each set can be represented by 
a finite number of parameters. For example, Xo could be a ball and represented 
by two parameters — center and radius. From here on, we will not distinguish 
between Xp and its parameterized representation. Similarly, the reachset R(X, t) 
also needs a representation. NeuReach represents the reachsets with ellipsoids. 
Given a vector xo € R” and a matrix C € R”*”, the set E(x, C) := {x € R” : 
|C - (x — zo)||2 < 1} is an ellipsoid. Thus, given the center, an ellipsoid can be 
represented by an n x n matrix. 

In order to use NeuReach, the user has to implement the following functions. 


(i) sample_X0(): Produces a random initial set p from a distribution P}. 
Specifically, the parameterized representation of Xo is returned. 

(ii) sample_t(): Produces a random sample of t from a distribution Pp. 

(iii) sample_x0(X0): Takes an initial set Xo, and produces a random sample of 
zo E€ Xo according to a distribution D(%). 

(iv) simulate(x0): Takes an initial state vp and generates a finite trajectory 
E(£o, :) which is a sequence of states at some time instants. The user should 
make sure that for every time instant returned by sample_t(), a state 
corresponding to it can be found in the simulated trajectory. 

(v) get_init_center(X0): Takes an initial set Yo and returns E [D(4)| := 

čen D(x) |£], which is the mean value of x over the initial states. 


Given these functions, NeuReach computes a reachability function R with an 
error guarantee (Theorem 1). The reachset R(4,t) is an ellipsoid centered at 
&(E [D(%X)] , t). As the output, NeuReach will generate a Python function R(XO, 
t). This function can be serialized and stored on disk for future use. When 
calling this function, the user provides the initial set Xo and t, and then ann xn 
matrix representing the shape of the ellipsoid will be returned. 


? As will be stated later, R is a neural network, which is indeed continuous. 
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4 Design of NeuReach: Learning reachability functions 


We present the design rationale behind NeuReach and discuss the learning algo- 
rithm it implements. We show that standard results in statistical learning theory 
give a probabilistic guarantee on the error of the learned reachability function. 


4.1 Reachability with Empirical Risk Minimization 


The basic idea is to model the reachset R(4Xo,t) as an ellipsoid around 

&(E [D(%X)] ,#). As stated earlier, given the center, an n-dimensional ellipsoid 
can be represented by an n x n matrix. Thus, learning the set-valued reacha- 
bility function R(%,t) becomes the problem of learning a matrix-valued func- 
tion C(%,t) that describes the shape of the set. We represent function C us- 
ing parametric models, such as neural networks. Let us denote this parametric, 
matrix-valued function by Cg, where 0 € W C R? is the vector of parameters. 
The parameter 0 could be, for example, a scalar representing a coefficient of a 
polynomial, a vector representing weights of a neural network, etc. Thus, the 
parametric reachability function is: 


Ro(o, t) := E(E(E[D(40)] , t), Co(o, t)). (2) 


To simplify the notations, for X = (Xo, t, zo) and parameter 0, we define 
a function go(X) := ||Ce(AX,t) (€(xo, t) — €(E [D(%o)] ,t))||,. For a particular 
sample X and a parameter 0, if gg(X) < 1, then €(xo,t) € Ro(Xo, t), otherwise 
it is outside and contributes to the error. The goal of our learning algorithm is 
to find a 0 to minimize the error of the resulting reachability function Rg, which 
gives the following optimization problem: 


0 = i P t Ro(Xo,t 
Tee y upi tse sey SBOE) (20, ve af o )! 


=memin g [i ( Col%axt)+ ( (20st) = €(& [D(%)] 1) 


=arg min 3 [I (go(X) -1 > 0)], 
(4 X:=(X0,t,£0) 


P) 


where I (-) is the indicator function. 

In order to solve the above optimization problem using empirical risk min- 
imization, we consider the following setup. First, a training set is constructed. 
We denote a training set with N samples by S = {X;}%_,, where the samples 
X; = (a tO, a) are independently drawn from the data distribution defined 
by Xo ~ Pi,t ~ P2, £o ~ D(X). The empirical loss on S for a parameter @ is 


N 
Lenw(8) = 9 tX) - 1), (3) 


where L(x) := max{0, = +1} is the hinge loss function with the hyper-parameter 
a > 0, which is a soft proxy for the indicator function. Therefore, the empirical 
loss Lerm is a soft, empirical proxy of the actual error as defined in Equation (1). 
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Arguments|Default Value Description 

system - Name of the Python file containing the model. 
lambda 0.03 à in Eq. (4). 
alpha 0.001 a in Eq. (3). 

N_XO 100 Nx,: Number of initial sets. 

N_x0 10 Nz j: Number of initial states. 

N_t 100 Nt: Number of time instants. 
layer1 64 Lı: Number of neurons in the first layer of the NN. 
layer2 64 L2: Number of neurons in the second layer of the NN. 
epochs 30 Number of epochs for training. 

lr 0.01 Learning rate. 


Table 1: Command-line arguments passed to the tool. 


In addition to minimizing the empirical loss, we would also like the over- 
approximation of the reachset to be as tight as possible. Thus, the volume of 
the ellipsoid should be penalized. Inspired by [14], we use — log(det(CTC)) as 
a proxy of the volume of an ellipsoid E(x9,C), and the following regularization 
term is added to penalize large ellipsoids. 


N 
ie = -5 Slog (det (cx, tOO, 1))). 


i=1 


Combining the two terms, we define the overall optimization problem: 


0 = arg min Lerm (0) + Lrec(), (4) 


where À is a hyper-parameter balancing two loss terms. 


Machine learning setup. The training set is constructed as follows. First, 
we sample Nx, initial sets by calling sample_X0(). Then, for each initial set, 
we sample N,, initial states from it using sample_x0(X0) and then get Nso 
trajectories by calling simulate(x0). Finally, for each trajectory, we sample N4 
time instants by calling sample_t(). Thus, the resulting training set contains 
N := Nx, X Nz, X N; samples, but generating such a training set only needs Nx, x 
Nz, trajectory simulations. NeuReach implements the optimization problem of 
Equation (4) in Pytorch [37] and solves it with stochastic gradient descent. By 
default, a three-layer neural network is used to represent Cg. For n-dimensional 
reachsets, the number of neurons in each layer are L4, Ly, and n?, where Lı and 
Lə can be specified by the user. The output vector of the neural network is then 
reshaped to be an n x n matrix. By default, we set a = 0.001 and A = 0.03. 
The neural network is trained for 30 epochs with a learning rate of 0.01. Hyper- 
parameters including learning rate, a, A, and size of the training set can be easily 
changed via the user interface as shown in Table 1. 


NeuReach: Learning Reachability Functions from Simulations 329 


4.2 Probabilistic Correctness of NeuReach 


The following theorem shows that the error of the learned reachability func- 
tion Rg can be bounded. Specifically, the difference between the error and the 
empirical loss is O(,/ +), where N is the size of the training set. 


Theorem 1. For any e€ > 0, and a random training set S with N i.i.d. samples, 
with probability at least 1 — 2exp(—2Ne?), the following inequality holds, 


N 
x [Io(X)-1> 0] < FLD Bt 0 


where p is the number of parameters, i.e. 0 € RP, and U) = min{1, ¢(-)} is the 
truncated hinge loss, and Lg is the Lipschitz constant of gg w.r.t. 0. 


Theorem 1 shows that by controlling € and N, the actual error 
zx {I (gg(X) —1 > 0)] can be made arbitrarily close to the empirical loss 


x = €(9g(Xi)—1), with arbitrarily high probability. The empirical loss on the 
training set S can be made very small in practice due to the high capacity of the 
neural network. Of course, there is no free lunch, in general. In order to drive 
the empirical loss to 0, we might have to increase the number of parameters, 
which in turn increases the term 2L Jz . Furthermore, the hyper-parameter 
X also affects the empirical loss. A smaller À results in lower empirical loss but 
more conservative reachsets. Actually, conservatism and accuracy are conflicting 
requirements. As shown in [21], when using reachability to verify safety, accuracy 
determines the soundness of the verification, while conservatism influences the 
sample efficiency. We wanted to focus more on soundness than on efficiency. 
Thus, a theoretical guarantee is derived for accuracy but not for conservatism. 


Proof. Starting from the left hand side and using the definition of hinge loss, 
we get Ex [I (g(X) —1>0)] < Ex [e96(X) — 1): By adding and subtracting 
the empirical loss term, we get: 


Ta oe 
x [Aga(X) - 1] -F Moa %) - 1) + M96) - 0 
4=1 x l {=1 , N l 
< sup ( ix [iw —2)] - y Z aol) - o) tpa) -1 


where the inequality follows from the definition of supremum. 


Let V = suPpew ( Dye [2(a0(X) = 1)| — 15" Ap- 1)), i.e. the worst- 
case difference between the empirical average and the expectation of the loss. 
Note that V is a random quantity since S = {X;}§, is random. Next, we derive 
an upper bound on Y that holds with high probability. 

First, we derive an upper bound on Es [V]. Let G be the function class con- 
taining gg parameterized by 0, i.e. G := {gọ(-) |0 € W}. Similally, F := {(go(-)— 
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1)|0 € W}. Applying G to the set of inputs S generates a new set G(S) := 
{(g(X1), (Xa), .g(Xw)) : g € G}. Define F(S) = {(F(X1), =, F(XN)) : 
f € F} in the same way. 

Notice that V is the worst-case (among all fọ € F) gap between the ex- 
pectation and the empirical average of fọ(X). A fundamental result in PAC 
learning (Theorem 3.3 in [35]) shows that this gap can be bounded as Es [VY] < 
Ug [Rad(F(S))], where Raa(F(S)) is the Rademacher complexity [35] of F(S). 
Furthermore, notice that F(S) can be generated from G(S) by shifting it and 
composing it with @. It follows from Talagrand’s contraction lemma 31] that 
Ug [Rad(F(S))] < 2L;Es [Rad(G(S))], where Lz = + is the Lipschitz constant. 

Finally, following from a conclusion on Rademacher complexity of Lipschitz 
parameterized function classes (See page 13 in [8]), we get Eg [Rad(G(S))] < 


3L9\/%. Therefore, we get 
: 12 p 
sV) Êlo 2. (6) 


Then, applying McDiarmid’s inequality [35] gives a high-probability bound 
on V. That is, 


N 


Pr (|v Š M] > c) < 2exp(—2Ne®). 


Together with Eq. (6), we have V < E [V] +e < 4L,,/£ + € with probability 
at least 1 — 2exp(—2Ne?). This implies 


with probability at least 1 — 2exp(—2Ne?), which completes the proof. 


5 Experimental evaluation 


We evaluated NeuReach on several benchmark systems including the Van der Pol 
oscillator, the Moore-Greitzer model of a jet engine, an 8-dimensional quadrotor 
controlled by a neural network [40], and an F-16 Ground Collision Avoidance 
system [28]. We also compare our method with DryVR [21]. Since NeuReach is 
fully data-driven and does not rely on the analytical model of the system, it 
would not make sense to compare against model-based methods like Hamilton- 
Jacobi reachability analysis [6], Flow* [11], C2E2 [18], or SReach [41]. Some of 
our benchmarks cannot be handled by these tools. Also, once the reachability 
function is learned, many reachsets can be computed very quickly by our method. 
Given that other tools need to compute the reachset from scratch for each new 
query, comparisons based on running times, would not make sense either. 
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5.1 Benchmark systems 


The simulators available for the benchmark systems allow us to specify fixed 
time-steps At and a time bound T. As for the distribution P2, we adopt the uni- 
form distribution, i.e. P> = Unif({At, 2At,--- , 4] At}) (Recall, the definition 
of this distribution in Section 3). For a given initial set Xo, D(A) is defined as 
the uniform distribution on the boundary of Xo. As shown in Corollary 1 of [43], 
the boundary of the reachable set of an initial set is equal to the reachable set 
of the initial set’s boundary for ODEs. That is, if the estimated reachable set 
contains the reachable set of the initial set’s boundary, it automatically contains 
that of the interior. Thus, we only sample points on the boundary of Xo to im- 
prove sample efficiency. As for the distribution P,, we will give details for each 
benchmark below. 


Van der Pol oscillator is a widely used 2-dimensional nonlinear model. An 
initial set Vp is a ball centered at c with radius r. The distribution P, for choos- 
ing æo is specified by the distributions for choosing these parameters. In our 
experiments, we use c ~ Unif([1,2] x [2,3]) and r ~ Unif({0,0.5]). The time 
bound is set to T = 4, and time step is At = 0.05. 


JetEngine model from [4] is also 2-dimensional and commonly used as a verifi- 
cation benchmark. Again, we use balls for the initial sets with c ~ Unif((0.3, 1.3] x 
(0.3, 1.3]) and r ~ Unif([0, 0.5]). The time bound is set to T = 10, and time step 
is At = 0.05. 


F-16 Ground Collision Avoidance System [28] is a challenging benchmark 
for formal analysis tools. This system consists of 16 state variables (See Table 1 
in [28]) among which Vt and alt are air speed and altitude. The key safety 
property of interest is ground collision avoidance, and therefore, in our exper- 
iments we focus on estimating the reachset only for Vt and alt. We consider 
initial uncertainty in up to 6 state variables, [Vt, a, ġ, Y, Q,alt]. The function 
simulate(x0) is designed to return projections of trajectories to Vt and alt, 
while sample_X0() returns 6-dimensional initial sets. We restrict the initial set 
to be hyper-rectangles as in [28]. An initial set Xo is determined by a center 
c € RÊ and a radius r € R® with % = {x € R°:c-r<a<c+r}. As for 
the distribution, we choose c ~ Unif((560, 600] x [—0.1,0.1] x [0, 4] x [- 4, 4] x 
[—0.1, 0.1] x [70, 80]) and r ~ Unif((0, 10] x [0, 0.1] x [0, 75] x [0, §] x [0, 0.1] x [0, 1]). 
The time bound is set to T = 20, and time step is At = 35° However, Dry VR 
does not support hyper-rectangles as initial sets. Thus, we also use another set- 
ting for comparison where the initial sets are balls. To do this, we sample balls 
from a cube with c ~ Unif([—1, 1] x --- x [—1, 1]) and r ~ Unif((0, 0.5]). Then, we 
transform this ball to the original coordinate system by scaling each dimension. 
This setting is shown in Fig. 2 (Left) as F-16 (Spherical). 


Quadrotor controlled by a neural controller is based on [40]. The state of 
the quadrotor system is x = [pz, Py, Pz, Ux, Vy, Vz, Ôx, 0y], and the control input 


332 D. Sun and S. Mitra 


800 
a 
xe 700 
a 
> 
600 
0 5 10 15 20 
e 
S 60 
oO 
et 
x 40 
z 
X 20 
0 5 10 15 20 
t (s) 


Fig. 1: Left: Some reachsets of JetEngine. Red curve is €(E [D(%o)],-). We randomly 
sample 100 trajectories starting from æo. Points on sampled trajectories are shown 
as black dots. Boundaries of the estimated reachsets at some selected time instants 
are shown. Clearly, ellipsoids can approximate the actual reachsets better; Right: A 
sample reachtube of F-16. Green region is the reachtube estimated by NeuReach, which 
is the union of all reachsets. Blue curves are sampled trajectories from the initial set. 
The blue region can be viewed as the actual reachtube. The estimated reachtube verifies 
the safety, i.e. alt > 0 always holds. 


is u := [az,Wz, Wy]. We are only interested in estimating the reachability of the 
position variables, i.e., the first 3 dimensions of the state vector. We use balls 
for the initial sets with c ~ Unif({[—-1,1] x --- x [-1,1]) and r ~ Unif([0, V8)]). 
The time bound is set to T = 10, and time step is At = 0.05. 


5.2 Experimental results 


Evaluation metrics. In order to evaluate the learned reachability function, 
we randomly sample 10 initial sets for testing. For each initial set Vo, we then 
sample 100 trajectories starting from it. For every sampled time instant on the 
sampled trajectories, we check whether the state is contained in the estimated 
reachset and compute the empirical error (i.e., the frequency that a sample is 
not in the estimated reachset). In order to evaluate the conservatism of the 
over-approximations, we also compare the size of the over-approximations. For 
each initial set Xo, we compute the total volume of the over-approximations 
R(X, ti) where t; = At, 2At,--- , peal At. Then, the total volume averaged over 
10 sampled initial sets are reported. Results are summarized in Figure 2 (Left). 
Please note that we use the default settings in Table 1 for all benchmarks. 

All experiments were conducted on a Linux workstation with two Xeon Sil- 
ver 4110 CPUs and 32 GB RAM. As shown in Figure 2 (Left), NeuReach learns 
an accurate reachability function for each benchmark. Please note that due to 
the complicated dynamics and the neural controller, the F-16 model and the 
quadrotor are beyond the reach of current model-based tools. As shown in Fig- 
ure 1 (Right), NeuReach successfully verified the safety of the F-16 model. 
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Impact of A 


NeuReach DryVR 
Volume|Error| Volume |Error 
JetEngine 17.9 |0.001} 38.3 10.003 
VanDerPol 39.2 |0.001 76.4 0.002 
Quadrotor 373.9 |0.019}1025146.2|0.021 
F-16 (Spherical) }28153.7/0.004) 62651.5 |0.004 
F-16 31465.9]0.025 - - 


Benchmark 


Fig. 2: Left: Volume and error of the estimated reachtube. Results are averaged over 
10 random choices of Xo; Right: Impact of A. Error bars are the range over 10 runs. 


Comparison with DryVR. DryVR [21] computes reachsets for spherical ini- 
tial sets by learning a piece-wise exponential discrepancy (PED) function that 
bounds the sensitivity of the trajectories to the initial state. This function is of 
the form: 

B(r, t) = rKeNs=i 1g (beter te (Eta) E [tii til, 


where r is the radius of the initial set, [t;_1, t;] is the i-th time interval, and K, y 
are learned parameters. For an spherical initial set Y% = B(c,r), the computed 
reachset is R(B(c,r),t) := B(E(E [D(B(c,r))],t), B(r,t)), where B(c,r) is a ball 
centered at c with radius r. It is important to recall that, similar to other reacha- 
bility tools, for every new initial set Xo, Dry VR computes the PED function and 
the reachset from scratch. For a fair comparison, we compute the parameters K 
and y on the exact same training set as the one used in NeuReach and reuse the 
resulting PED for further queries. 


Accuracy and conservatism. As shown in Figure 2 (Left), the reachsets 
estimated by NeuReach are tighter and more accurate than those computed by 
DryVR. There are two reasons for this. First, DryVR uses piece-wise exponential 
functions to capture the relationship between the initial radius and the radius at 
time t, while NeuReach uses more expressive neural networks. Second, the use of 
ellipsoids allows coordinate-specific accuracy. As seen in Figure 1, the reachset 
of JetEngine is not a perfect circle even if the initial set is a circle. Ellipsoids 
can approximate the actual reachsets better. 


Running time. As expected, the training phase of NeuReach takes several 
minutes, but once a reachability function has been learned, computation of the 
reachset from a new initial set is very fast. For the quadrotor system, for example, 
this takes ~ 0.3 ms on the aforementioned workstation. We believe that this 
makes NeuReach suitable for online safety checking and motion planning. 


Impact of the hyper-parameter A. A influences the error and volume of the 
reachsets computed by NeuReach. Figure 2 (Right) shows the result of running 
NeuReach on JetEngine with different settings of A. As expected, larger A results 
in smaller reachsets but hurts the accuracy. On the other hand, we do not need 
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to tune A case by case. Note that we use À = 0.03 for all the results in Figure 2 
(Left), and it works reasonably well for all our benchmarks. 


6 Conclusion 


In this paper, we presented a tool for computing reachability of systems using 
machine learning. NeuReach can learn accurate reachability functions for com- 
plex nonlinear systems, including some that are beyond existing methods. From a 
learned reachability function, arbitrary reachtubes can be computed in millisec- 
onds. There are several limitations in the current implementation of NeuReach. 
First, the simulator is assumed to be deterministic—this can be too restrictive 
for autonomous systems with complex perception and vehicle models. We plan 
to extend the theory and implementation to support more general simulators. 
Secondly, the over-approximations are restricted to be represented as ellipsoids. 
Other representations will be supported in the future. 
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Abstract. We present a PDR/IC3 algorithm for finding inductive in- 
variants with quantifier alternations. We tackle scalability issues that 
arise due to the large search space of quantified invariants by combining 
a breadth-first search strategy and a new syntactic form for quantifier- 
free bodies. The breadth-first strategy prevents inductive generalization 
from getting stuck in regions of the search space that are expensive to 
search and focuses instead on lemmas that are easy to discover. The new 
syntactic form is well-suited to lemmas with quantifier alternations by 
allowing both limited conjunction and disjunction in the quantifier-free 
body, while carefully controlling the size of the search space. Combining 
the breadth-first strategy with the new syntactic form results in useful 
inductive bias by prioritizing lemmas according to: (i) well-defined syn- 
tactic metrics for simple quantifier structures and quantifier-free bodies, 
and (ii) the empirically useful heuristic of preferring lemmas that are fast 
to discover. On a benchmark suite of primarily distributed protocols and 
complex Paxos variants, we demonstrate that our algorithm can solve 
more of the most complicated examples than state-of-the-art techniques. 


Keywords: invariant inference - quantifier alternation - PDR/IC3 


1 Introduction 


Invariant inference is a long-standing problem in formal methods, due to the 
desire for verified systems without the cost of manually writing invariants. For 
complex unbounded systems the required invariants often involve quantifiers, 
including quantifier alternations. For example, an invariant for a distributed 
system may need to quantify over an unbounded number of nodes, messages, etc. 
Furthermore, it may need to nest quantifiers in alternation (between V and 3) to 
capture the system’s correctness arguments. For example, one crucial invariant of 
the Paxos consensus protocol [22] is “every decision must come from a quorum of 
votes”, i.e. Vdecision.dquorum.Vnode. node € quorum = node voted for decision. 
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We show that automatically inferring such invariants is possible for systems 
beyond the current state of the art by addressing several scalability issues that 
arise as the complexity of systems and their invariants increases. 


Many recent successful invariant inference techniques, including ours, are 
based on PDR/IC3 [3,5]. PDR/IC3 is an algorithmic framework for finding in- 
ductive invariants incrementally, rather than attempting to find the entire induc- 
tive invariant at once. PDR/IC3 progresses by building a collection of lemmas, 
organized into frames labeled by number of steps from the initial states, until 
eventually some of these lemmas form an inductive invariant. New lemmas are 
generated by inductive generalization, where a given (often backward reachable) 
state is generalized to a formula that excludes it and is inductive relative to a 
previous frame. Inductive generalization therefore plays a key role in PDR/IC3 
implementations. Specifically, extending PDR/IC3 to a new domain of lemmas 
requires a suitable inductive generalization procedure. 

Techniques for inductive generalization, and more broadly for generating for- 
mulas for inductive invariants, are varied, including interpolation [25], quantifier 
elimination [20], model-based techniques [18], and syntax guided synthesis [6,31]. 
Almost all of these existing techniques target either quantifier-free or universally 
quantified invariants. While it is sometimes possible to manually transform a 
transition system to eliminate some of the need for quantifiers [8], doing so is 
difficult and requires some knowledge of the fully quantified invariant. 


We present a system that can infer quantified invariants with alternations 
based on quantified separation, which was introduced in [19]. Roughly, a separa- 
tion query asks whether there is a quantified formula, a separator, that evaluates 
to true on a given set of models and to false on another given set of models. While 
[19] used separation (as a black box) to implement inductive generalization and 
described the first PDR/IC3 implementation that finds invariants with quan- 
tifier alternations, it did not scale to challenging protocols such as Paxos and 
its variants. These protocols require invariants with many symbols and quanti- 
fiers, and the search space for quantified separators explodes as the number of 
symbols in the vocabulary and number of quantifiers increases. In contrast, this 
work presents a technique that can automatically find such complex invariants. 


When targeting complex invariants, there are two main challenges for induc- 
tive generalization: (i) the run time of each individual query; and (ii) overfitting, 
i.e., learning a lemma that eliminates the given state but does not advance the 
search for an inductive invariant. We tackle both problems via two strategies: 
the first integrates inductive generalization with separation in a breadth-first 
way, and the second defines a new form, k-term pDNF, for the quantifier-free 
Boolean structure of the separators. 

Integrating quantified separation with inductive generalization enables us to 
effectively use a breadth-first rather than a depth-first search strategy for the 
quantifiers of potential separators: we search in multiple parts of the search space 
simultaneously rather than exhaustively exploring one region before moving to 
the next. Beyond enabling parallelism, and thus faster wall-clock times, this 
restructuring can change which solution is found by allowing easy-to-search re- 
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gions to find a solution first. We find that these easier-to-find formulas generalize 
better (i.e., avoid overfitting). 

Using k-term pDNF narrows the search space for lemmas with quantifier al- 
ternations. Universally quantified invariants can be split into universally quan- 
tified clauses by transformation into conjunctive normal form (CNF). Accord- 
ingly, most PDR/IC3 based techniques find invariants as conjunctions of possibly 
quantified clauses. However, invariants with quantifier alternations may require 
conjunction inside quantified lemmas (e.g., consider Vx.dy.p(y) Ar(a, y)). Using 
multiple clauses per lemma (k-clause CNF) creates a significantly larger search 
space, impeding scalability. Using disjunctive normal form (DNF) suffers from 
the same problem. We introduce k-term pDNF, a class of Boolean formulas in- 
spired by human-written invariants that allows both limited conjunction and 
disjunction while keeping the search space manageable. Many of the lemmas 
arising in our evaluation that require many clauses in CNF are only 2-term 
pDNF. We modify separation to search for lemmas of this form, leading to a re- 
duced search space compared to CNF or DNF, resulting in both faster inductive 
generalization and less overfitting. 

We evaluate our technique on a benchmark suite that includes challenging dis- 
tributed protocols. Inferring invariants with quantifier alternations has recently 
drawn significant attention, with recent works, [19,11], presenting techniques 
based on PDR/IC3 that find invariants with quantifier alternations but do not 
scale to complex protocols such as Paxos. Very recently, [14] and [12] presented 
enumeration-based and PDR/IC3-based techniques, respectively, which find the 
invariant for simple variants of Paxos, but do not scale to more complex variants. 
Our experiments show that our separation-based approach significantly advances 
the state-of-the-art, and scales to several Paxos variants which are unsolved by 
prior works. We also present an ablation study that investigates the individual 
effect of key features of our technique. 

This work makes the following contributions: 


1. An algorithm for inductive generalization in PDR/IC3 (Section 3) based on 
quantified separation that explores the search space in a parallel, breadth- 
first way and thus focuses on lemmas that are easy to discover without 
requiring a priori knowledge of the search space. 

2. A syntactic form of lemmas (k-pDNF, Section 4) that is well-suited for in- 
variants with quantifier alternations. 

3. A combined system (Section 5) able to infer the invariants of challenging 
protocols with quantifier alternations, including complex Paxos variants. 

4. A comprehensive evaluation (Section 6) on a large benchmark suite including 
complex Paxos variants, comparisons with a variety of state-of-the-art tools, 
and an ablation study exploring the effects of key features of our technique. 


2 Background 


We review first-order logic, quantified separation, the invariant inference prob- 
lem, and PDR/IC3. 
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First-Order Logic. We consider formulas in many-sorted first-order logic with 
uninterpreted functions and equality. A signature consists of a finite set of sorts 
and sorted constant, relation, and function symbols. A first-order structure over 
a given signature consists of a universe set of sorted elements along with in- 
terpretations for each symbol. A structure is finite when its universe is finite. 
We use the standard definitions for term, atomic formula, literal, quantifier-free 
formula. Quantified formulas may contain universal (V) and existential (4) quan- 
tifiers with sorted variables (e.g. Vv:s1.p). A formula is in prenez normal form if 
it consists of a (possibly empty) quantification prefix followed by a quantifier-free 
matrix. Any formula can be mechanically transformed into an equivalent prenex 
formula. A structure M satisfies a formula p, written M = p, if the formula 
is true when the symbols in p are interpreted according to M under the usual 
semantics. If such an M exists, then p is satisfiable and M is a model of p. 


Quantified Separation. To generate candidate lemmas, we use quantified sepa- 
ration [19]. Given a set of structure constraints and a predetermined space of 
formulas, separation produces a separator formula p from the space that satisfies 
the constraints, or reports UNSEP if no such p exists. The constraints are either 
positive (a structure M where M } p), negative (a structure M where M t p) 
or implication (a pair of structures M, M’ where M H p => M’ } p). Separa- 
tion producing prenex formulas under some assumptions (satisfied by practical 
examples) is NP-complete [19], and can be solved by translation to SAT. 


Invariant Inference. The invariant inference problem is to compute an inductive 
invariant for a given transition system, which shows that only safe states are 
reachable from the initial states. We consider a transition system to be a set 
of states as structures over some signature satisfying an axiom Az, some initial 
states satisfying Init, a transition formula Tr which can contain primed symbols 
(x') representing the post-state, and safe states satisfying Safe. We define bad 
states as “Safe. We define single-state implication, written A => B, as UNSAT( A^ 
Az 4B) and two-state implication across transitions, written A => wp(B),* as 


Unsat(AA Az TrA Ad \7B’). An inductive invariant is a formula I satisfying: 
Initt> TI (1) I => wp(J) (2) I => Safe (3) 


Together, (1) and (2) mean that I is satisfied by all reachable states, and (3) 
ensures the system is safe. We only consider invariant inference for safe systems. 


PDR/IC3. PDR/IC3 is an invariant inference algorithm first developed for finite 
state model checking [3] and later extended to various classes of infinite-state 
systems. We describe PDR/IC3 as in [17]. PDR/IC3 maintains frames F; as 
conjunctions of formulas (lemmas) representing overapproximations of the states 
reachable in at most į transitions from Init. Finite frames (i = 0,...,) and the 
frame at infinity (i = 00) satisfy: 


* Our use of wp is inspired by predicate transformers, but we define it via satisfiability. 
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Fi => wp(Fizi) (7) Foo > wp(Fæ) (8) 


Conditions (4), (5), and (6) mean Init = F; for all i, and we ensure this by 
restricting frames to subsets of the prior frame, when taken as sets of lemmas. 
Conditions (7) and (8) say each frame is relatively inductive to the prior frame, 
except Foo which is relatively inductive to itself and thus inductive for the sys- 
tem. To initialize, the algorithm adds the (conjuncts of) Init and Safe as lemmas 
to Fo. The algorithm then proceeds by adding lemmas to frames using either 
pushing or inductive generalization while respecting this meta-invariant, grad- 
ually tightening the bounds on reachability until Foo = Safe. We can push a 
lemma p € F; to Fj41, provided F; = wp(p). When a formula is pushed, the 
stronger F,41; may permit us to push one or more other formulas, possibly recur- 
sively, and so we always push until a fixpoint is reached. Any mutually relatively 
inductive set of lemmas do not have a finite fixpoint, and we detect these sets 
(by checking for F; = Fi+1) and move them to Fy. 

If the algorithm cannot push a lemma pa beyond frame i, there is a model of 
4(F; = wp(pa)), which is a transition s > t where s € F; and t - pa. We call 
the pre-state s a pushing preventer of pa. To generate new lemmas, we block the 
pushing preventer s in F; by first recursively blocking all predecessors of s that 
are still in F;_,, and then using an inductive generalization (IG) query to learn 
a new lemma that eliminates s. An IG query finds a formula p satisfying: 


sp (9) Init = p (10) Fi-1Ap=wp(p) (11) 


If we can learn such a lemma, it can be added to F; and all previous frames, 
and removes at least the state s stopping pa from being pushed. Classic PDR/IC3 
always chooses to block the pushing preventer of a safety property (lemma from 
Safe) or a predecessor thereof, but other strategies have been considered [17]. 
The technique used to solve IG queries controls what kind of invariants we are 
able to discover. In this work we use separation to solve for p, which lets us infer 
invariants with quantifier alternations. 


3 Breadth-First Inductive Generalization with Separation 


Inductive generalization is the core of PDR/IC3, and improving it comes in two 
flavors: making individual queries faster, and generating better lemmas that are 
more general. We address both of these concerns by restructuring the search to 
be breadth-first rather than depth-first. We first discuss naively solving an IG 
query with separation (as in [19]), then present an algorithm that restructures 
the search in a breadth-first manner. 


3.1 Naive Inductive Generalization with Separation 


An IG query is solved in [19] with separation by a simple refinement loop, which 
performs a series of separation queries with an incrementally growing set of 
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structure constraints. Starting with a negative constraint s for the state to block, 
we ask for a separator p and check if eqs. (10) and (11) hold for p using a 
standard SMT solver. If both hold, p is a solution to the IG query. Otherwise, 
the SMT solver produces a model which becomes either a positive constraint 
(corresponding to an initial state p violates) or an implication constraint (a 
transition edge that shows p is not relatively inductive to F;_1), respectively. 
At a high level, the SAT-based algorithm for separation from [19] uses 
Boolean variables to encode the kind (V/3) and sort of each quantifier, and addi- 
tional variables for the presence of each syntactically valid literal in each clause 
in the matrix, which is in CNF. It then translates each structure constraint 
into a Boolean formula over these variables such that satisfying assignments en- 
code formulas with the correct truth value for each structure. The details of the 
translation to SAT are not relevant here, except a few key points: (i) separation 
considers each potential quantifier prefix essentially independently, (ii) complex 
IG queries can result in hundreds or thousands of constraints, and (iii) prefixes, 
as partitions of the space of possible separators, vary greatly in how quickly they 
can be explored. Further, with the black box approach where the prefixes are 
considered internally by the separation algorithm, even if the separation algo- 
rithm uses internal parallelism as suggested in [19], there is still a serialization 
step when a new constraint is required. As a consequence of (ii) and (iii), a 
significant failure mode of this naive approach is that the search becomes stuck 
generating more and more constraints for difficult parts of the search space that 
ultimately do not contain an easy-to-discover solution to the IG query. 


3.2 Prefix Search at the Inductive Generalization Level 


To fix the problems with the naive approach, we propose lifting the choice of 
prefix to the IG level, partitioning a single large separation query into a query 
for each prefix. Each sub-query can be explored in parallel, and each can pro- 
ceed independently by querying for new constraints (using eqs. (10) and (11) as 
before) without serializing by waiting for other prefixes. We call this a breadth- 
first search, because the algorithm can spend approximately equal time on many 
parts of the search space, instead of a depth-first search which exhausts all pos- 
sibilities in one region before moving on to the next. When regions have greatly 
varying times to search, the breadth-first approach prevents expensive regions 
from blocking the search in cheaper regions. This improvement relies on chang- 
ing the division between separation and inductive generalization: without the 
knowledge of the formulas (eqs. (10) and (11)) that generate constraints, the 
separation algorithm cannot generate new constraints on its own. 

A complicating factor is that in addition to prefixes varying in difficulty, 
sometimes there are entire classes of prefixes that are difficult. For example, 
many IG queries have desirable universal-only solutions, but spend a long time 
searching for separators with alternations, as there are far more distinct pre- 
fixes with alternations than those with only universals. To address this problem, 
we define possibly overlapping sets of prefixes, called prefix categories, and en- 
sure the algorithm spends approximately equal time searching for solutions in 
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def IG(s: state, i: frame): 
VP. C(P) = {Negative(s)}; 
for i=1...N in parallel: 
while true: 
P = next-prefix(); 
while true: 
p = separate C(P); 
if p is UNSEP: 
| break 
elif any c € Ro(P) and p Ke: 
add c to O(P) 
elif (c := SMT check eqs. (10) and (11)) # UNSAT: 
| add cto C(P) 
else: 
| return p as solution 


Fig. 1. Pseudocode for our proposed inductive generalization algorithm. 


each category (e.g., universally quantified invariants, invariants with at most one 
alternation and at most one repeated sort). Within each category, we order pre- 
fixes to further bias towards likely solutions: first by smallest quantifier depth, 
then fewest alternations, then those that start with a universal, and finally by 
smallest number of existentials. 


3.3 Algorithm for Inductive Generalization 


We present our algorithm for IG using separation in Figure 1. Our algorithm has 
a fixed number N of worker threads which take prefixes from a queue subject to 
prefix restrictions, and perform a separation query with that prefix. Each worker 
thread calls next-prefix() to obtain the next prefix to consider, according to the 
order discussed in the previous section. To solve a prefix P, a worker performs 
a refinement loop as in the naive algorithm, building a set of constraints C(P) 
until a solution to the IG query is discovered or separation reports UNSEP. 

While we take steps to make SMT queries for new constraints as fast as pos- 
sible (Section 5.4), these queries are still expensive and we thus want to re-use 
constraints between prefixes where it is beneficial. Re-using every constraint dis- 
covered so far is not a good strategy as the cost of checking upwards of hundreds 
of constraints for every candidate separator is not justified by how frequently 
they actually constrain the search. Instead, we track a set of related constraints 
for a prefix P, Rco(P). We define related constraints in terms of immediate sub- 
prefizes of P, written S(P), which are prefixes obtained by dropping exactly 
one quantifier from P, i.e. the quantifiers of P’ € S(P) are a subsequence of 
those in P with one missing. We then define Ro(P) = Upres(pyC(P’), ie. the 
related constraints of P are all those used by immediate sub-prefixes. While 
S(P) considers only immediate sub-prefixes, constraints may propagate from 
non-immediate sub-prefixes as the algorithm progresses. 

Constraints from sub-prefixes are used because the possible separators for 
those queries are also possible separators for the larger prefix. Thus the set of 
constraints from sub-prefixes will definitely eliminate some potential separators, 
and in the usual case where the sub-prefixes have converged to UNSEP, will rule 
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out an entire section of the search space. We also opportunistically make use of 
known constraints for the same prefix generated in prior IG queries, as long as 
those constraints still satisfy the current frame. 

Overall, the algorithm in Figure 1 uses parallelism across prefixes to gener- 
ate independent separation queries in a breadth-first way, while carefully sharing 
only useful constraints. From the perspective of the global search for an induc- 
tive invariant the algorithm introduces two forms of inductive bias: (i) explicit 
bias arising from controlling the order and form of prefixes (Section 3.2), and 
(ii) implicit bias towards formulas which are easy to discover. 


4 k-Term Pseudo-DNF 


We now consider the search space for quantifier-free matrices, and introduce a 
syntactic form that shrinks the search space while still allowing common invari- 
ants with quantifier alternations to be expressed. 

Conjunctive and disjunctive normal forms (CNF and DNF) are formulas that 
consist of a conjunction of clauses (CNF) or a disjunction of cubes (DNF), where 
clauses and cubes are disjunctions and conjunctions of literals, respectively: For 
example, (a V b V =c) A (b V c) is in CNF and (a A 7c) V (~a A b) is in DNF. We 
further define k-clause CNF and k-term DNF as formulas with at most k clauses 
and cubes, respectively. 

In [19] separation is performed by finding a matrix in k-clause CNF, biasing 
the search by minimizing the sum of the number of quantifiers and k. We find 
that both CNF and DNF are not good fits for the formulas in human-written 
invariants. For example, consider the following formula from Paxos: 


Vr1,72,01,02,g-4n.11 < r2 A proposal(rg, v2) A vı Æ v2 


— member(n, q) A left-round(n, r1) A avote(n, r1, v1) 


To write this in CNF, we need to distribute the antecedent over the conjunction, 
obtaining the 3-clause formula: 


(rı < r2 A proposal(r2, v2) A v1 # ve > member(n, q)) A 
(rı < r2 A proposal(r2, v2) A vı Æ v2 — left-round(n, r1)) A 


(rı < r2 A proposal(r2, v2) A vı Æ ve > >vote(n, r1, v1)) 


When written without —, this matrix has the form ~a V =b V cV (d^e ^= f), 
which is already in DNF. Under the k-term DNF, however, the formula requires a 
single-literal cube for each antecedent literal, i.e. k = 4. Because of the quantifier 
alternation, we cannot split this formula into cubes or clauses, and so a search 
over either CNF or DNF must consider a significantly larger search space. To 
solve these issues, we define a variant of DNF, k-term pseudo-DNF (k-pDNF), 
where one cube is negated, yielding as many individual literals as needed: 


Definition 1 (k-term pseudo-DNF). A quantifier-free formula p is in k- 
term pseudo-DNF fork > 1 if p = 7c, V c2 V ... V Ck, where c1,...,Ck are 
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cubes. Equivalently, p is in k-term pDNF if there exists n > 0 such that p = 
LV ...Vbn Veg V...V cK, where l1,...,bn are literals and c2,...,Ck are cubes. 


Note that 1-term pDNF is equivalent to 1-clause CNF, i.e. a single clause. 2- 
term pDNF correspond to formulas of the form (cube) — (cube). Such formulas 
are sufficient for all but a handful of the lemmas required for invariants in our 
benchmark suite. An exception is the following, which has one free literal and 
two cubes (so it is 3-term pDNF): 


Vur. Jna, N2, N3, V2, V3. 
(d(v1) > am(n1) A u(n, v1)) V 
(am(n2) A am(n3) A u(ne, v2) A u(n, v3) A v2 Æ v3) 


For a fixed k, k-clause CNF, k-term DNF, and k-term pDNF all have the 
same-size search space, as the SAT query inside the separation algorithm will 
have one indicator variable for each possible literal in each clause or cube. The 
advantage of pDNF is that it can express more invariant lemmas with a small 
k, reducing the size of the search space while still being expressive. We can also 
see pDNF as a compromise between CNF and DNF, and we find that pDNF is 
a better fit to the matrices of invariants with quantifier alternation. 


5 An Algorithm for Invariant Inference 


We now take a step back to consider the high-level PDR/IC3 structure of our 
algorithm. We have described how our algorithm performs inductive generaliza- 
tion (Sections 3 and 4), which is the central ingredient. We next discuss blocking 
states that are not backward reachable from a bad state as a heuristic for finding 
additional useful lemmas. We then discuss how we can search for formulas in the 
EPR logic fragment and techniques to increase the robustness of SMT solvers. 
Finally, we give a complete description of our proposed algorithm. 


5.1 May-proof-obligations 


In classic PDR/IC3, the choice of pushing preventer to block is always that of 
a safety property. |17] proposed a heuristic that in our terminology is to block 
the pushing preventer of other existing lemmas, under the heuristic assumption 
that current lemmas in lower frames are part of the final invariant but lack a 
supporting lemma to make them inductive. The classic blocked states are known 
as must-proof-obligations, as they are states that must be eliminated somehow 
to prove the safety property. In contrast, these heuristic states are may-proof- 
obligations, as they may or may not be necessary to block. Our algorithm selects 
these lemmas at random, biased towards lemmas with smaller matrices. 

To block a state, we first recursively block its predecessors in the prior frame, 
if they exist. For may-proof-obligations,° this recursion can potentially reach all 


5 For unsafe transition systems, this can also occur for must-proof-obligations. 
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the way to an initial state in Fo, and thus proves that the entire chain of states is 
reachable— i.e., the states cannot be blocked. This fact shows that the original 
lemma is not part of any final invariant and cannot be pushed past its current 
frame; it also provides a positive structure constraint useful for future IG queries. 


5.2 Multi-block Generalization 


After an IG query blocking state s is successful, the resulting lemma p may cause 
the original lemma that created s to be pushed to the next frame. If not, there 
will be a new pushing preventer s’. If s’ is in the same frame, we can ask whether 
there is a single IG solution formula pı which blocks both s and s’. If we can 
find such a py, it is more likely to generalize past s and s’, and we should prefer 
pi. This is straightforward to do with separation: we incrementally add another 
negative constraint to the existing separation queries. To implement multi-block 
generalization, we continue an IG query if the new pushing preventer is suitable 
(i.e. exists and is in the same frame), accumulating as many negative constraints 
as we can until we do not have a suitable state or we have spent as much time 
as the original query. This timeout guarantees we do not spend more than half 
of our time on generalization, and protects us in the case that the new set of 
states cannot be blocked together with a simple formula. 


5.3 Enforcing EPR 


Effectively Propositional Reasoning (EPR, [28]) is a fragment of many-sorted 
first-order logic in which satisfiability is decidable and satisfiable formulas always 
have a finite model. The essence of EPR is to limit function symbols, both 
in the signature and from the Skolemization of existentials, to ensure only a 
finite number of ground terms can be formed. EPR ensures this property by 
requiring that there be no cycles in the directed graph with an edge from each 
domain sort to the codomain sort for every (signature and Skolem) function 
symbol. For example, (Vz:S. 91) V (Sy:S. p2) is in EPR, but Va:S. dy:S. p3 is not 
in EPR as the Skolem function for y introduces an edge from sort § to itself. 
The acyclicity requirement means that EPR is not closed under conjunction, 
and so is best thought of as a property of a whole SMT query rather than of 
individual lemmas. Despite these restrictions, EPR can be used to verify complex 
distributed protocols [28]. 

For invariant inference with PDR/IC3, the most straightforward way to en- 
force acyclicity is to decide a priori which edges are allowed, and to not infer 
lemmas with disallowed Skolem edges. In practice, enforcing EPR means simply 
skipping prefixes during IG queries that would create disallowed edges. Without 
this fixed set of allowed edges, adding a lemma to a frame may prevent a neces- 
sary lemma from being added to the frame in a later iteration, as PDR/IC3 lacks 
a way to remove lemmas from frames. Requiring the set of allowed edges as input 
is a limitation of our technique and other state-of-the-art approaches (e.g. [14]). 
We hope that future work expands the scope of decidable logic fragments, so 
that systems require less effort to model in such a fragment. It is also possible 
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that our algorithm could be wrapped in an outer search over the possible acyclic 
sets of edges. 

Because separation produces prenex formulas, some EPR formulas would be 
disallowed without additional effort (e.g. a prenex form of (Vx:S. p1) V (Sy:S. p2) 
is Va:S. dy:S.(~1) V (y2)). In our implementation, we added an option where 
separation produces prenex formulas that may not be in EPR directly, but where 
the scope of the quantifiers can be pushed down into the pDNF disjunction to 
obtain an EPR formula. Extra SAT variables are introduced that encode whether 
a particular quantified variable appears in a given disjunct, and we add the 
constraint that the quantifiers are nested consistently and in such a way as to 
be free of disallowed edges. Because this makes separation queries more difficult, 
we only enable this mode for the single example that requires non-prenex EPR 
formulas. 


5.4 SMT Robustness 


Even with EPR restrictions, some SMT queries we generate are difficult for the 
SMT solvers we use (Z3 [4] and CVC5°), sometimes taking minutes, hours, or 
longer. This wide variation of solving times is significant because separation, and 
thus IG queries, cannot make progress without a new structure constraint. We 
adopt several strategies to increase robustness: periodic restarts, running multi- 
ple instances of both solvers in parallel, and incremental queries. Our incremental 
queries send the formulas to the SMT solver one at a time, asserting a subset 
of the input. An UNSAT result from a subset can be returned immediately, 
and a SAT result can be returned if there is no un-asserted formula violated by 
the model. Otherwise, one of the violating formulas is asserted, and the process 
repeats. This process usually avoids asserting all the discovered lemmas from a 
frame, which significantly speeds up many of the most difficult queries, especially 
those with dozens of lemmas in a frame or those not in EPR. 


5.5 Complete Algorithm 


Figure 2 presents the pseudocode for our algorithm, which consists of two parallel 
tasks (learning and heuristic), each using half of the available parallelism to 
discharge IG queries, and pushing to fixpoint after adding any lemmas. In this 
listing, the to-block(¢) function computes the state and frame to perform an IG 
query in order to push £ (i.e. the pushing preventer of £ or a possibly multi-step 
predecessor thereof). The heuristic task additionally may find reachable states, 
and thus mark lemmas as bad. We cancel an IG query when it is solved by a 
lemma learned or pushed by another task. If the algorithm terminates, then the 
conjunction of F is inductive according to the underlying SMT solver. 

Our algorithm is parameterized by the logic used for inductive generalization, 
and thus the form of the invariant. We support universal, EPR, and full first- 
order logic (FOL) modes. Universal mode restricts the matrices to clauses, and 


® Successor to CVC4 [1]. 
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def P-Fot-Ic3(): def LEARNING(): 
Fo = init U safety; while true: 
push(); s,i = to-block(safety); 
start LEARNING(), HEuRIsTIC(); | MULTIBLOCK(safety, s, i); 
wait for invariant; def Hevristic(): 
def MuttisLock(é: lemma, s: state, i): while true: 
S = {s}; £ = random lemma before 
while not timed out: safety; 
p = IG(S, i); s,i = to-block(é); 
speculatively add p to frame i; if i = 0: 
s’,i’ = to-block(é); | mark s reachable; 
remove p from frame i; mark bad lemmas; 
ifi=i': else: 
| add s’ to S; | Mutrrptock(é, s, i); 
else: 
| break 
add p to frame i; 
push(); 


Fig. 2. Pseudocode for our proposed inference algorithm, P-FoL-Ic3. 


considers predecessors of superstructures when computing to-block() (as in [18]). 
EPR mode also takes as input the set of allowed edges. In FOL mode, there are 
no restrictions on the prefix. 


6 Evaluation 


We evaluate our algorithm and compare with prior approaches on a benchmark 
of invariant inference problems. We discuss the benchmark, our experimental 
setup, and the results. 


6.1 Invariant Inference Benchmark 


Our benchmark is composed of invariant inference problems from prior work on 
distributed protocols [29,28,7,27,30,2,9], written in or translated to the mypyvy 
tool’s input language [26]. Our benchmark contains a total of 30 problems (Ta- 
ble 1), ranging from simple (toy-consensus, firewall) to complex (stoppable- 
paxos-epr, bosco-3t-safety). Some problems admit invariants that are purely uni- 
versal, and others use universal and existential quantifiers, with some in EPR. All 
our examples are safe transition systems with a known human-written invariant. 


6.2 Experimental Setup 


We compare our algorithm to the techniques Swiss [14], IC3PO [11,12], fol-ic3 
[19], and PDRY [18]. We performed our experiments on a 56-thread machine 
with 64 GiB of RAM, with each experiment restricted to 16 hardware threads, 
20GiB of RAM, and a 6 hour time limit.’ To account for noise caused by ran- 
domness in seed selection, we ran each algorithm 5 times and report the number 


T Specifically, an dual-socket Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz. 
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Table 1. Experimental results, giving both the median wall-clock time (seconds) of run 
time and the number of trials successful, out of five. If there were less than 3 successful 
trials, we report the slowest successful trial, indicated by (>). A dash (-) indicates all 
trials failed or timed out after 6 hours (21600 seconds). A blank indicates no data. 


R Our # Swiss # IC3PO # fol-ic3 # PDRY 


Example EP 

lockserv Vv 19 5 9573 4 55 75 65 
toy-consensus-forall Vv 45 22 5 45 115 45 
ring-id Vv 75 192 5 815 28 5 20 5 
sharded-kv Vv 8 517291 5 45 19 5 65 
ticket Vv 23 5 - 0 -0 240 5 22 5 
learning-switch Vv 76 5 1744 4 29 5 - 0 94 5 
consensus-wo-decide Vv 50 5 52 5 65 33 5 29 5 
consensus-forall Vv 1908 5 80 5 15 5 1125 5 104 5 
cache Vv 2492 4 - 0 3906 5 - 0 2628 5 
paxos-forall Vv 885 5 - 0 - 0 - 0 555 5 
flexible-paxos-forall Vv 1961 5 - 0 1654 5 -0 423 5 
stoppable-paxos-forall Vv 7779 5 - 0 - 0 - 0 - 0 
fast-paxos-foral Vv - 0 - 0 - 0 - 0 20176 3 
vertical-paxos-forall Vv - 0 - 0 - 0 - 0 - 0 
firewall = 45 - 0 35 95 
sharded-kv-no-lost-keys | v 45 95 45 55 
toy-consensus-epr v 45 10 5 45 49 5 
ring-id-not-dead = 19 5 - 0 -0 221 3 
consensus-epr v 37 5 57 5 28 5 - 0 
client-server-ae v 45 11 5 45 4425 
client-server-db-ae = 16 5 46 5 37 5 6639 4 
hybrid-reliable-broadcast] — 178 5 - 0 -0 937 5 
paxos-epr v 920 5 14332 4 - 0 - 0 
flexible-paxos-epr v 418 5 4928 5 - 0 - 0 
multi-paxos-epr v 4272 4 - 0 - 0 - 0 
stoppable-paxos-epr v | >18297 2 - 0 - 0 - 0 
fast-paxos-epr v 9630 3 - 0 - 0 - 0 
vertical-paxos-epr v - 0 - 0 - 0 - 0 
block-cache-async = - 0 - 0 - 0 - 0 
bosco-3t-safety v {>110191 1 - 0 - 0 - 0 


‘With EPR push down enabled. 


of successes and the median time. PDRY, IC3PO, and fol-ic3 are not designed 
to use parallelism, while Swiss and our technique make use of parallelism. For 
IC3PO, we use the better result from the two implementations [11] and [12], and 
give reported results for those we could not replicate. For our technique, we ran 
the tool in universal-only, EPR, or full FOL mode as appropriate. For k-pDNF, 
we use k = 1 for universal prefixes and k = 3 otherwise. 


6.3 Results and Discussion 


We present the results of our experiments in Table 1. In general, for examples 
that converge with both prior approaches and our technique, we match or ex- 
ceed existing results, with significant performance gains for some problems such 
as client-server-db-ae relative to the previous separation-based approach. Along 
with other techniques, we solve paxos-epr and flexible-paxos-epr, which are the 
simplest variants of Paxos in our benchmark, but nonetheless represents a sig- 
nificant jump in complexity over the examples solved by the prior generation of 
PDR/IC3 techniques. Paxos and its variants are notable for having invariants 
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Table 2. Ablation study. Columns are interpreted as in Table 1. 


Example Our # No pDNF # No EPR # No Inc. SMT # No Gen. # 
lockserv 19 34 5 13 
toy-consensus-forall 4 5 4 
ring-id 7 11 13 
sharded-kv 8 11 7 
ticket 23 42 21 
learning-switch 76 338 288 
consensus-wo-decide 50 50 51 
consensus-forall 1908 2154 558 
cache 2492 >16826 13116 
paxos-forall 885 1071 10488 


flexible-paxos-forall >4168 


stoppable-paxos-forall 


KOON NA AAAA A OAO OSO AA OA eA oroooo O O ON 
PNONMFKHROUTTTOTHNONAOYOOrRN AATA A OA A OA A ON 


5 

5 

5 

5 

5 

5 

5 

2 

5 

5 

5 
fast-paxos-foral - >16573 1 - 
vertical-paxos-forall - - 0 - 
firewall 4 45 45 4 
sharded-kv-no-lost-keys 4 45 45 55 4 
toy-consensus-epr 4 55 55 55 5 
ring-id-not-dead 19 37 5 44 5 52 
consensus-epr 37 126 5 724 5 45 5 233 
client-server-ae 4 35 45 45 4 
client-server-db-ae 16 13 5 20 5 10 
hybrid-reliable-broadcast 178 98 5 173 5 629 
paxos-epr 920 10135 4 >2895 1 609 5 3201 
flexible-paxos-epr 418 13742 3 - 0 775 5 799 
multi-paxos-epr 4272 >15176 1 - 0 15854 3 7326 
stoppable-paxos-epr >18297 - 0 - 0 >20659 1 >11946 
fast-paxos-epr 9630 - 0 - 0 8976 3 >20871 
vertical-paxos-epr - - 0 - 0 - 0 - 
block-cache-async - - 0 - 0 >20038 
bosco-3t-safety >11019 - 0 - 0 >8581 1 >16689 


with two quantifier alternations (V3V) and a maximum quantifier depth of 6 or 7. 
We uniquely solve multi-, fast-, and stoppable-paxos-epr, which add significant 
complexity in the number of sorts, symbols, and quantifier depth required. Due 
to variations in seeds and the non-determinism of parallelism, our technique was 
only successful in some trials, but these results nevertheless demonstrate that 
our technique is capable of solving these examples. Our algorithm is unable to 
solve vertical-paxos-epr, as this example requires a 7 quantifier formula that is 
very expensive for our IG solver. 

For universal-only examples, our algorithm is able to solve all but one of the 
examples® solved by other techniques, and is able to solve one that others cannot. 
In some cases (e.g. consensus-forall), our solution is slower than other approaches, 
but on the whole our algorithm is competitive in a domain it is not specialized for. 
In addition, we significantly outperform the existing separation-based algorithm 
(fol-ic3) by solving several difficult examples (cache, paxos-forall). 


6.4 Ablation Study 


Table 2 presents an ablation study investigating effect of various features of our 
technique. The first column of Table 2 repeats the full algorithm results, and 
the remaining columns report the performance with various features disabled 


8 fast-paxos-forall, which is solved by our technique in the ablation study, albeit rarely. 
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Table 3. Parallel vs sequential comparison. Each of 5 trials ran with 3 or 48 hour 
timeouts, respectively. The number of successes, and the average number of IG queries 
in each trial (including failed ones) are given. 


Successes | IG Queries 


Example Par. Seq. | Par. Seq. 
paxos-epr 5 5 61 76 
flexible-paxos-epr 5 5 64 72 
multi-paxos-epr 3 1 67 84 


individually. The most important individual contributions come from k-pDNF 
matrices and EPR. Using a 5-clause CNF instead of pDNF matrix (No pDNF) 
causes many difficult examples to fail and some (e.g., flexible-paxos-epr) to take 
significantly longer even when they do succeed.’ Similarly, using full FOL mode 
instead of EPR (No EPR) leads to timeouts for all but the simplest Paxos 
variants. Incremental SMT queries (No Inc. SMT) make the more difficult Paxos 
variants, and the universal cache example, succeed much more reliably. Multi- 
block generalization (No Gen.) makes many problems faster or more reliable, 
but disabling it allows block-cache-async to succeed. 

To isolate the benefits of parallelism, we ran several examples in both parallel 
and serial mode with a proportionally larger timeout (Table 3). In both modes 
we use a Single prefix category containing all prefixes, with the same static order 
over prefixes.'° Beyond the wall-clock speedup, the parallel IG algorithm affects 
the quality of the learned lemmas, that is, how well they generalize and avoid 
overfitting. To estimate the quality of generalization, we count the total number 
of IG queries performed by each trial and report the average over the five trials. In 
all examples, the parallel algorithm learns fewer lemmas overall, which suggests it 
generalizes better. We attribute this improved generalization to the implicit bias 
towards lemmas that are faster to discover. For the more complicated example 
(multi-paxos-epr), this difference has an impact on the success rate. 


7 Related Work 


Extensions of PDR/IC3. The PDR/IC3 [3,5] algorithm has been very influen- 
tial as an invariant inference technique, first for hardware (finite state) systems 
and later for software (infinite state). There are multiple extensions of PDR/IC3 
to infinite state systems using SMT theories [16,20]. [18] extended PDR/IC3 to 
universally quantified first-order formulas using the model-theoretic notion of 
diagrams. [13] applies PDR/IC3 to find universally quantified invariants over ar- 
rays and also to manage quantifier instantiation. Another extension of PDR/IC3 
for universally quantified invariants is [23], where a quantified invariant is gen- 
eralized from an invariant of a bounded, finite system. This technique of gener- 
alization from a bounded system has also been extended to quantifiers with al- 


° With a single clause, there is no difference between CNF and k-pDNF so results are 
only given for existential problems. 
10 To make the comparison cleaner, we also disabled multi-block generalization. 
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ternations [11]. Recently, [31] suggested combining synthesis and PDR/IC3, but 
they focus on word-level hardware model checking and do not support quan- 
tifier alternations. Most of these works focus on quantifier-free or universally 
quantified invariants. In contrast, we address unique challenges that arise when 
supporting lemmas with quantifier alternations. 

The original PDR/IC3 algorithm has also been extended with techniques that 
use different heuristic strategies to find more invariants by considering additional 
proof goals and collecting reachable states [15,17]. Our implementation benefits 
from some of these heuristics, but our contribution is largely orthogonal as our 
focus is on inductive generalization of quantified formulas. Generating lemmas 
from multiple states, similar to multi-block generalization, was explored in [21]. 

[24] suggests a way to parallelize PDR/IC3 by combining a portfolio approach 
with problem partitioning and lemma sharing. Our parallelism is more tightly 
coupled into PDR/IC3, as we parallelize the inductive generalization procedure. 


Quantified Separation. Quantified separation [19] was recently introduced as a 
way to find quantified invariants with quantifier alternations. While [19] intro- 
duced a way to combine separation and PDR/IC3, it has limited scalability 
and cannot find the invariants of complex protocols such as Paxos. Our work 
here is motivated by these scalability issues. In contrast to [19], our technique 
is able to find complex invariants by avoiding expensive but useless areas of 
the search space using a breadth-first strategy and a multi-dimensional induc- 
tive bias. While [19] searches for quantified lemmas in CNF, we introduce and 
use k-term pDNF. k-term pDNF can express the necessary lemmas of many 
distributed protocols more succinctly, resulting in better scalability. 


Synthesis-Based Approaches to Invariant Inference. Synthesis is a common ap- 
proach for automating invariant inference. ICE [10] is a framework for learning 
inductive invariants from positive, negative, and implication constraints. Our 
use of separation is similar, but it is integrated into PDR/1IC3’s inductive gen- 
eralization, so unlike ICE we find invariants incrementally. 


Enumeration-Based Approaches. Another approach is to use enumerative search, 
for example [6], which only supports universal quantification. Enumerative search 
has been extended to quantifier alternations in [14], which is able to infer the 
invariants of complex protocols such as some Paxos variants. 


8 Conclusion 


We have presented an algorithm for quantified invariant inference that combines 
separation and inductive generalization. Our algorithm uses a breadth-first strat- 
egy to avoid regions of the search space that are expensive. We also explore a 
new syntactic form that is well-suited for lemmas with alternations. We show 
via a large scale experiment that our algorithm advances the state of the art 
in quantified invariant inference with alternations, and finds significantly more 
invariants on difficult problems than prior approaches. 
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Abstract. The most scalable approaches to certifying neural network 
robustness depend on computing sound linear lower and upper bounds 
for the network’s activation functions. Current approaches are limited 
in that the linear bounds must be handcrafted by an expert, and can 
be sub-optimal, especially when the network’s architecture composes op- 
erations using, for example, multiplication such as in LSTMs and the 
recently popular Swish activation. The dependence on an expert pre- 
vents the application of robustness certification to developments in the 
state-of-the-art of activation functions, and furthermore the lack of tight- 
ness guarantees may give a false sense of insecurity about a particular 
model. To the best of our knowledge, we are the first to consider the 
problem of automatically synthesizing tight linear bounds for arbitrary 
n-dimensional activation functions. We propose the first fully automated 
method that achieves tight linear bounds while only leveraging the math- 
ematical definition of the activation function itself. Our method leverages 
an efficient heuristic technique to synthesize bounds that are tight and 
usually sound, and then verifies the soundness (and adjusts the bounds 
if necessary) using the highly optimized branch-and-bound SMT solver, 
DREAL. Even though our method depends on an SMT solver, we show 
that the runtime is reasonable in practice, and, compared with state of 
the art, our method often achieves 2-5X tighter final output bounds and 
more than quadruple certified robustness. 


1 Introduction 


Prior work has shown that neural networks are vulnerable to various types of 
(adversarial) perturbations, such as small /-norm bounded perturbations [39], ge- 
ometric transformations [13, 22], and word substitutions [2]. Such perturbations 
can often cause a misclassification for any given input, which may have serious 
consequences, especially in safety critical systems. Certifying robustness to these 
perturbations has become an important problem as it can show the network does 
not exhibit these misclassifications, and furthermore previous work has shown 
that a given input feature’s certified robustness can be a useful indicator to 
determine the feature’s importance in the network’s decision [34, 25]. 
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Indeed, many approaches have been proposed for certifying the robustness 
of inputs to these perturbations. Previous work typically leverages two types of 
techniques: (1) fast and scalable, but approximate techniques [36, 15, 45, 34, 25], 
and (2) expensive but exact techniques that leverage some type of constraint 
solver [23, 24, 40]. Several works have also combined the two [37, 35, 43, 42]. 
The most successful approaches, in terms of scalability in practice, are built on 
top of the approximate techniques, which all depend on computing linear bounds 
for the non-linear activation functions. 

However, a key limitation is that the linear bounds must be handcrafted and 
proven sound by experts. Not only is this process difficult, but also ensuring the 
tightness of the crafted bounds presents an additional challenge. Unfortunately, 
prior work has only crafted bounds for the most common activation functions 
and architectures, namely ReLU [43], sigmoid, tanh [36, 48, 46], the exp func- 
tion [34], and some 2-dimensional activations found in LSTM networks [25]. As a 
result, existing tools for neural network verification cannot handle a large num- 
ber of activation functions that are frequently used in practice. Examples include 
the GELU function [18], which is currently the activation function used in Ope- 
nAl’s GPT [31], and the Swish function which has been shown to outperform the 
standard ReLU function in some applications [32] and, in particular, can reduce 
over-fitting in adversarial training [38]. In addition, these recently introduced ac- 
tivation functions are often significantly more complex than previous activation 
functions, e.g., we have gelu(x) = 0.52(1 + tanh [,/2/(a# + 0.04471523)]). 

In this work, we study the problem of efficiently and automatically syn- 
thesizing sound and tight linear bounds for any arbitrary activation function. 
By arbitrary activation function, we mean any (non-linear) computable func- 
tion z = o(a1,...,@q) used inside a neural network with d input variables. By 
sound we mean, given an interval bound on each variable zı € [l1, u1], £2 € 
(lo, ua],.--,@a € [la, ua], the problem is to efficiently compute lower bound co- 
efficients c},ch,..., agi and upper bound coefficients cf, c3, .. -, C41 such that 
the following holds: 


Vr1€ (i, ui], £2 € lo, wa],.--, £a E (la, ua] (1) 
cay tera t+ +h, < alti., ta) < cai + Bag te t cta 
By automatically, we mean that the above is done using only the definition of the 
activation function itself. Finally, by tight, we mean that some formal measure, 
such as the volume above/below the linear bound, is minimized/maximized. 
We have developed a new method, named LINSYN, that can automatically 
synthesize tight linear bounds for any arbitrary non-linear activation function 
a(-). We illustrate the flow of our method on the left-hand side of Fig. 1. As 
shown, LINSYN takes two inputs: a definition of the activation function, and an 
interval for each of its inputs. LINSYN outputs linear coefficients such that Equa- 
tion 1 holds. Internally, LINSYN uses sampling and an LP (linear programming) 
solver to synthesize candidate lower and upper bound coefficients. Next, it uses 
an efficient local minimizer to compute a good estimate of the offset needed to 
ensure soundness of the linear bounds. Since the candidate bounding functions 
constructed in this manner may still be unsound, finally, we use a highly op- 
timized branch-and-bound nonlinear SMT solver, named DREAL [14], to verify 
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Fig. 1. The overall flow of LINSYN. 


the soundness of the linear bounds. Even though our new method involves the 
use of solvers and optimizers, the entire process typically takes less than 1/100th 
of a second per pair of bounds. 

Fig. 1 also illustrates how LINSYN fits in with existing neural network verifi- 
cation frameworks, such as ERAN [1], and AUTOLIRPA [47]. These tools take 
as input a neural network, and a region of the neural networks input space, and 
compute an over-approximation of the neural network’s outputs. Internally, these 
frameworks have modules that compute linear bounds for a specific activation 
functions. LINSYN is a one-size-fits-all drop-in replacement for these modules 
that are invoked at runtime whenever a linear bound of a non-linear activation 
function is needed. 

Our method differs from these existing frameworks because a user (usually an 
expert in neural network verification) must provide hand-crafted, sound linear 
bounds for the activation functions of a neural network. However, to date, they 
only support the previously mentioned activation functions. We note however 
that the recent framework AUTOLIRPA supports binary operations (namely 
addition, subtraction, multiplication, and division) as “activation functions”. 
Thus, while it’s not explicitly designed to handle complex activations, it has the 
ability to by decomposing, e.g., gelu(x) into operations that it supports, and 
then combining them. In contrast, LINSYN bounds the activation function as a 
whole, which we will show produces much tighter linear bounds. 

We have implemented our method in tool called LINSYN, and evaluated it 
on benchmarks in computer vision and natural language processing (NLP). Our 
evaluation shows that we can obtain final output bounds often 2-5X tighter 
than the most general tool [47], thus allowing us to drastically increase certi- 
fied robustness. In addition, our tool achieves accuracy equal to or better than 
the handcrafted LSTM bounds of POPQORN [25], which is currently the most 
accurate tool for analyzing LSTM-based NLP models, at a comparable runtime. 

To summarize, this paper makes the following contributions: 


— We propose the first method for automatically synthesizing tight linear 
bounds for arbitrary activation functions. 
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— We implement our approach in a tool called LINSYN, and integrate it as a 
bounding module into the AUTOLIRPA framework, thus producing a neural 
network verification tool that can theoretically compute tight linear bounds 
for any arbitrary activation function. 

— We extensively evaluate our approach and show it outperforms state-of-the- 
art tools in terms of accuracy and certified robustness by a large margin. 


The rest of this paper is organized as follows. First, we provide the technical 
background in Section 2. Then, we present our method for synthesizing the linear 
bounds in Section 3 and our method for verifying the linear bounds in Section 4. 
Next, we present the experimental results in Section 5. We review the related 
work in Section 6 and, finally, give our conclusions in Section 7. 


2 Preliminaries 


In this section, we define the neural network verification problem, and illustrate 
both how state-of-the-art verification techniques work, and their limitations. 


2.1 Neural Networks 


Following conventional notation, we refer to matrices with capital bold letters 
(e.g. W € R”*™), vectors as lower case bold letters (e.g. x € R”), and scalars 
or variables with lower case letters (e.g. x € R). Slightly deviating from the 
convention, we refer to a set of elements with capital letters (e.g. X C R”). 

We consider two types of networks in our work: feed-forward and recurrent. 
We consider a feed-forward neural network to be a (highly) non-linear function 
f: X — Y, where X C R” and Y C R™. We focus on neural network classifiers. 
For an input x € X, each element in the output f(x) represents a score for a 
particular class, and the class associated with the largest element is the chosen 
class. For example, in image classification, X would be the set of all images, each 
element of an input x € X represents a pixel’s value, and each element in Y is 
associated with a particular object that the image might contain. 

In feed-forward neural networks the output f(x) is computed by performing 
a series of affine transformations, i.e., multiplying by a weight matrix, followed 
by application of an activation function o(-). Formally, a neural network with | 
layers has l two-dimensional weight matrices and l one-dimensional bias vectors 
W;,b;, where 7 € 1..1, and thus we have f(x) = W,-o(Wj-1----o0(Wy-x+ 
by)--- + bj_1) + bı, where o(-) is the activation function applied element-wise 
to the input vector. The default choice of activation is typically the sigmoid 
o(x) = 1/(1 + e7”), tanh, or ReLU function o(x) = maz(0, x), however recent 
work [18, 32, 31] has shown that functions such as gelu(x) and swish(x) = 
x x sigmotd(a) can have better performance and desirable theoretical properties. 

Unlike feed-forward neural networks, recurrent neural networks receive a se- 
quence of inputs [x,..., x], and the final output of f on xs is used to perform 
the classification of the whole sequence. Recurrent neural networks are state-ful, 
meaning they maintain a state vector that contains information about inputs 
previously given to f, which also gets updated on each call to f. In particular, 
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we focus on long short-term memory (LSTM) networks, which have seen wide 
adoption in natural language processing (NLP) tasks due to their sequential 
nature. For LSTMs trained for NLP tasks, the network receives a sequence of 
word embeddings. A word embedding is an n-dimensional vector that is associ- 
ated with a particular word in a (natural) language. The distance between word 
embeddings carries semantic significance — two word embeddings that are close 
to each other in R” typically have similar meanings or carry a semantic relat- 
edness (e.g. dog and cat or king and queen), whereas unrelated words typically 
are farther apart. 

LSTM networks further differ from feed-forward networks in that their inter- 
nal activation functions are two-dimensional. Specifically, we have the following 
two activation patterns: o1(x) x o2(y) and x x o1(y). The default choices are 
o1(x) = sigmoid(x), and o2(x) = tanh(x). However, we can swap cı with any 
function with output range bounded by [0,1], and swap o2 with any function 
with output range bounded by [—1, 1]. Indeed, prior work [16] has shown that 


o1(x) =1—e® * can achieve better results in some applications. 


2.2 Neural Network Verification 


A large number of problems in neural network verification can be phrased as the 
following: given an input region X C X, compute an over-approximation Y, such 
that {f(x) |x € X} C Y CY. Typically X and Y are hyper-boxes represented 
by an interval for each of their elements. A common problem is to prove that 
a point x € X is robust, meaning that small perturbations will not cause an 
incorrect classification. In this case, X is the set of all perturbed versions of x, 
and to prove robustness, we check that the element of the correct class in Y has 
a lower bound that is greater than the upper bound of all other elements. 

We illustrate a simple verification problem on the neural network shown in 
Fig. 2. The network has two inputs, x1, £2, and two outputs #7, xg which repre- 
sent scores for two different classes. We refer to the remaining hidden neurons 
as x;, i € 3..6. Following prior work [36], we break the affine transformation and 
application of the activation function into two separate neurons, and the neurons 
are assumed to be ordered such that, if x; is in a layer before xj, then i < j. 
For simplicity, in this motivating example, we let o(x) = maz(0, x) (the ReLU 
function). We are interested in proving that the region xı € [—1, 1], x2 € [-1,1] 
always maps to the first class, or in other words, we want to show that the lower 
bound of x7 is greater than the upper bound zg. 


2.3 Existing Methods 


The most scalable approaches (to date) for neural network verification are based 
on linear bounding and back-substitution [47], also referred to as abstract inter- 
pretation in the polyhedral abstract domain [36] or symbolic interval analysis [43] 
in prior work. 

For each neuron x; in the network, these approaches compute a concrete 
lower and upper bound lj, uj, and a linear lower and upper bound in terms of 
the previous layer’s neurons. The linear bounds (regardless of the choice of o(-)) 
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-Xi +X. < x3 O<x5 < —xX5 + X6 < X7 
< -x + x2 0.5x3 + 1 < -X5 +X6 


l3 = -2,u3 =2 b= 0u s2 1e=—1u,=1 


= =-24,2=2 k=0u=2 k= =1,u=1 
-X1 +X. S X4 0 < xe <S —X5 + Xg < Xg 
< -x1 + x2 0.5x4 + 1 < -X5 + X6 Fig. 3. Linear bounds 


for ReLU activation. 


Fig. 2. Example of neural network verification. 


have the following form: Da Ti: c + c OT ae Li cf + c}. The bounds 
are computed in a forward, layer-by-layer fashion which guarantees that any 
referenced neurons will already have a bound computed when back-substitution 
is performed. 

To obtain the concrete bounds lj, uj for a neuron zj, the bounds of any non- 
input neurons are recursively substituted into the linear bounds of x; until only 
input nodes z1, ..., &n remain. Finally, the concrete input intervals are substituted 
into the bound to obtain lj, uj. 


Example We illustrate on the two-layer network in Fig. 2 for the previously 
defined property. We trivially have lı = lg = —1, uy = ug = 1, —1 < xı < 1, and 
—1 < z2 < 1. We then compute linear bounds for £x3,x4 in terms of previous 
layer’s neurons 21, £2. We multiply £1, x2 by the edge weights, obtaining — z1 +£2 
as the lower and upper bound for both of x3 and x4. Since this bound is already 
in terms of the input variables, we substitute the concrete bounds into this 
equation and obtain l3 = l4 = —2 and u3 = u4 = 2. 

Next, we need to compute the linear bounds for £5 = o(a3) and xg = (24) 
after applying the activation function. Solving this challenge has been the focus 
of many prior works. There are two requirements. First, they need to be sound. 
For example, for x; we need to find coefficients c} , c}, cv, c¥ such that cha3+ch < 
a(x3) < c4¥x3 + c3 for all x3 € [ls, u3], and similarly for zg. Second, we want 
them to be tight. Generally, this means that volume below the upper bound is 
minimized, and volume below the lower bound is maximized. 

As an example, prior work [36, 48] proposed the following sound and tight 
bound for a(x) = mazx(0, x): 


ui —liui 0 shew 
Vz; € [l;, u] e- Fá + ui; S a(x;) B -h< uy 


We illustrate the bound for zs in Fig. 3. After computing this bound, we recur- 
sively substitute variables in the bounds of x; with the appropriate bound, and 
compute l5, us. The process then repeats for xe, followed by x7 and zg. We then 
check ly > ug to verify the property, which fails in this case. 
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2.4 Limitations of Existing Methods 


Current approaches only support a limited number of activation functions, and 
designing linear bounds for new activation functions often requires a significant 
amount of effort even for a domain expert. For example, handcrafted sound 
and tight linear bounds for activation functions such as ReLU, sigmoid, and 
tanh [36, 45, 48, 46, 44, 43], convolution layers and pooling operations [6], the 
two-dimensional activations found in LSTMs [25, 33], and those in transformer 
networks [34] are worthy of publication. Furthermore, even bounds that are 
hand-crafted by experts are not always tight. For example, a recent work [46] 
was able to nearly triple the precision of previous state-of-the-art sigmoid and 
tanh linear bounds simply by improving tightness. 

To the best of our knowledge, AUTOLIRPA [47] is the only tool that has 
the ability to handle more complex activation functions, though it was not origi- 
nally designed for this. It can do so by decomposing them into simpler operations, 
and then composing the bounds together. We illustrate with swish(a) = x x 
sigmoid(x), where x € [—1.5,5.5]. AUTOLIRPA would first bound sigmoid(x) 
over the region [—1.5, 5.5], resulting in the bound .1la + .35 < sigmoid(x) < 
.22x + .51. For the left-hand side of the function, we trivially have x < x < a. 
AuTOLIRPA would then bound a multiplication y x z, where in this case y = x 
and z = sigmoid(x), resulting in the final bound —.15a—.495 < xx sigmoid(x) < 
0.8252 + .96. We illustrate this bound in Fig. 4, and we provide bounds com- 
puted by LINSYN as a comparison point. LINSYN provides a slightly better upper 
bound, and a significantly better lower bound. The reason for the looseness is be- 
cause when AUTOLIRPA bounds sigmoid(x), it necessarily accumulates some 
approximation error because it is approximating the behavior of a non-linear 
function with linear bounds. The approximation error effectively “loses some 
information” about about its input variable x. Then, when bounding the multi- 
plication operation, it has partially lost the information that y and z are related 
(i.e. they are both derived from x). In contrast, LINSYN overcomes this issue by 
considering swish(x) as a whole. We explain how in the following sections. 


3 Synthesizing the Candidate Linear Bounds 


In this section, we describe our method for synthesizing candidate, possibly 
unsound linear bounds. 


3.1 Problem Statement and Challenges 


We assume we are given a d-dimensional activation function z = o(21,...,2a), 
and an input interval x; € [l;,u;] for each i € {1..d}. Our goal is to synthesize 
linear coefficients cl, c¥, where i € {1..d+ 1} that are sound, meaning that the 


following condition holds: 


Vay € |b, u], £2 € [lo, ua], .-., £a € (La, ua] (2) 


l l l 
cizi + C382 +++ + Capi S Oli, £2,- ) < fay Pere +e 
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. ne 4 6 
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Fig. 4. Bounds computed by LINSYN v1 
and AUTOLIRPA for swish(x), x € 


[-1.5, 5.5}. Fig. 5. Candidate plane synthesis. 


In addition, we want to ensure that the bounds are tight. The ideal definition 
of tightness would choose linear bounds that maximize the precision of the overall 
analysis, for example minimizing the width of the output neuron’s intervals. 
Unfortunately, such a measure would involve all of the neurons of the network, 
and so is impractical to compute. Instead, the common practice is to settle for 
tightness that’s local to the specific neuron we are bounding. 

Informally, we say a bound is tight if the volume below the upper bound is 
minimized, and volume below the lower bound is maximized. Prior work [48, 
36, 25] has found this to be a good heuristic!. Formally, volume is defined as 


the following integral: ings a A SL cia; + Ch, dz1...dxa which, for the 


upper bound, should be minimized subject to Equation 2. This integral has the 
following closed-form solution: 


d d d 


S Lo X (Cues _ oe) + casi * |] (ui - 4) (3) 


i=0 j=0 i=0 


where 1;=; is the (pseudo Boolean) indicator function that returns 1 when its 
predicate is true. We omit the proof, but note that the above expression can be 
derived inductively on d. Also note that, since each l;, u; are concrete, the above 
expression is linear in terms of the coefficients, which will be advantageous in 
our approach below. 

While recent approaches in solving non-linear optimization problems [26, 8] 
could directly minimize Equation 3 subject to Equation 2 in one step, we find 
the runtime to be very slow. Instead, we adopt a two-step approach that first 
uses efficient procedures for computing candidate coefficients that are almost 
sound (explained in this section), and second, only calls an SMT solver when 
necessary to verify Equation 2 (explained in the next section). We illustrate the 
approach on a concrete example. 


1 We also experimented with minimizing the volume between the linear bound and 
the activation function, which gave almost identical results. 
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3.2 Synthesizing Candidate Bounds 


The first step in our approach computes candidate coefficients for the linear 
bound. In this step we focus on satisfying the tightness requirement, while mak- 
ing a best effort for soundness. We draw inspiration from prior work [33, 3] 
that leverages sampling to estimate the curvature of a particular function, and 
then uses a linear programming (LP) solver to compute a plane that is sound. 
However, unlike prior work which targeted a fixed function, we target arbitrary 
(activation) functions, and thus these are special cases of our approach. 

The constraints of the LP are determined by a set of sample points S C Rê. 
For the upper bound, we minimize Equation 3, subject to the constraint that 
the linear bound is above o(-) at the points in S. Using s; to refer to the it” 
element of the vector s € S, the linear program we solve is: 


minimize Equation (3) subject to VAN C181 + C282 +--+ + Cap1 > a(s) (4) 
ses 


We generate S by sampling uniformly-spaced points over the input intervals. 


Example We demonstrate our approach on the running example illustrated in 
Fig. 5. For the example, let o(x1) = aa (the sigmoid function, shown as the 
blue curve), where xı € [—1,3.5]. We focus only on the upper bound, but the 
lower bound is computed analogously. 

Plugging in the variables into Equation 3, the objective of the LP that we 


minimize is: J ci xı + cy dx, = 6.625c] + 4.5c5 which is shown as the shaded 
1 


region in Fig. 5. 

We sample the points S = {—1, 0.25, 1.5, 2.75}, resulting in the following four 
constraints: —cy + cg > o(—1) A 0.25c1 + c2 > 0 (0.25) A 1.5c1 + cs > a(1.5) A 
2.75¢c1 + cg > o (2.75). Solving the LP program results in cı = 0.104, c2 = 0.649, 
which is illustrated by the green line in Fig. 5. 


4 Making the Bound Sound 


In this section, we present our method for obtaining soundness because the 
candidate bounds synthesized in the previous section may not be sound. Here, 
we focus only on making the upper bound sound, but note the procedure for the 
lower bound is similar. 


4.1 Problem Statement and Challenges 


We are given the activation function o(-), the input intervals x; € [l;, u;], and the 
candidate coefficients c,,c2,...,Ca+1- The goal is to compute an upward shift, 
if needed, to make the upper bound sound. First, we define the violation of the 
upper bound as: 


OW L2,- .-, Ed) IS C L1 + eye + ee + Chyi — OG L2,- La) (5) 
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A negative value indicates the upper bound is not sound. We then need to 
compute a lower bound on v(-), which we term v;. Then the equation we pass to 
the verifier is: 

Varr € [l1, ui], v2 € [l2, ua], ..., £a € [la, ua] 


u(@1,@2,---,La) + (~) > 0 


(6) 


Expanding v(-) with its definition in the above equation results in the soundness 
definition of Equation 2. Thus, if the verifier proves Equation 6, then shifting the 
upper bound upward by —v ensures its soundness. For our running example, 
the quantity vı is shown by the red line in Fig. 5. 

This problem is non-trivial because finding a solution for v, requires a search 
for a sound global minimum/maximum of a function involving o(-), which may 
be highly non-linear. State-of-the-art SMT solvers such as Z3 do not support 
all non-linear operations, and furthermore, since we assume arbitrary o(-), the 
problem may even be (computationally) undecidable. 


4.2 Verifying the Bound 


We first assume we have a candidate (possibly unsound) v, and explain our 
verification method. To ensure decidability and tractability, we leverage the ð- 
decision procedure implemented by DREAL [14]. To the best of our knowledge 
this is is the only framework that is decidable for all computable functions. 

In this context, instead of verifying Equation 6, the formula is first negated 
thus changing it into an existentially quantified one, and then applying a 6- 
relazation. Formally, the formula DREAL attempts to solve is: 


dr, € (li, u1], £2 € lo, u2],..., £a € (la, ua] (7) 
v(a@1,¥2,...) + (~u) < 6 
where ô is a small constant (e.g. 1075), which we explain in a moment. The 
above is formulated such that Equation 6 holds if (but not only if) there does 
not exist a solution to Equation 7. 

Internally, DREAL performs interval constraint propagation (ICP) on the left- 
hand side of Equation 7 over the intervals defined by each [l,;, u;] to compute an 
upper bound, and compares this upper bound with ô. If the upper bound is less 
than 6, then no solution exists (i-e., Equation 7 is unsatisfiable, and we have 
proven the original Equation 6 holds). Otherwise a solution may exist. In this 
case, DREAL iteratively partitions the input space defined by the [l;, u;i] and 
repeats this process on each partition separately. 

DREAL stops partitioning either when it proves all partitions do not have 
solutions , or when a partition whose intervals all have width less than some e is 
found. Here, € is proportional to ô (i.e., smaller 6 means smaller €). In the latter 
case, DREAL returns this partition as a “solution”. 

While Equation 6 holds if there does not exist a solution to Equation 7, 
the converse does not hold true both because of the error inherent in ICP, and 
because we “relaxed” the right-hand side of Equation 7. This means that 6 
controls the precision of the analysis. 6 controls both the size of the false solution 
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space, and determines how many times we will sub-divide the input space before 
giving up on proving Equation 7 to be unsatisfiable. 

Practically, this has two implications for our approach. The first one is that 
our approach naturally inherits a degree of looseness in the linear bounds defined 
by ô. Specifically, we must shift our plane upward by 6 in addition to the true 
uy, so that DREAL can verify the bound. The second is that we have to make 
a trade-off between computation and precision. While smaller ô will allow us 
to verify a tighter bound, it generally will also mean a longer verification time. 
In our experiments, we find that 6 = 1077 gives tight bounds at an acceptable 
runtime, though we may be able to achieve a shorter runtime with a larger ô. 


4.3 Computing vı 


Now that we have defined how we can verify a candidate bound, we explain our 
approach for computing vı. The implementation is outlined in Algorithm 1. Since 
failed calls to the verifier can be expensive, at lines 1-2, we first use a relatively 
cheap (and unsound) local optimization procedure to estimate the true v. While 
local optimization may get stuck in local minima, neural network activation 
functions typically do not have many local minima, so neither will v(-). We use 
L-BFGS-B [7], the bounded version of L-BFGS, to perform the optimization. At 
a high-level, L-BFGS-B takes as input v(-), the input bounds 2; € [l;, ui], and 
an initial guess g € Rĉ at the location of the local minimum. It then uses the 
Jacobian matrix (i.e., derivatives) of v(-) to iteratively move towards the local 
minimum (the Jacobian can be estimated using the finite differences method 
or provided explicitly — we use Mathematica [21] to obtain it). We find that 
sampling points uniformly in v(-) can usually find a good g, and thus L-BFGS-B 
often converges in a small number of iterations. L-BFGS-B typically produces an 
estimate within 1078 of the true value. To account for estimation error we add 
an additional 10~°, plus 2 x 6 to account for the d-relaxation (line 3). Finally, 
we iteratively decrease v; by a small amount (10~°) until DREAL verifies it (lines 
4-9). 

Going back to our motivating example, we would estimate v; with a local 
minimizer, and then use DREAL to verify the following: 


Vaz € [-1,3.5] . o(x1) < clay +c + (=v) +2 x 6410 


If verification fails, we iteratively decrease the value of v by 107°, and call DREAL 
until the bound is verified. The final value of cx, + c¥ + (—v;) +2 x 6+ 107° 
is the final sound upper bound. 


4.4 On the Correctness and Generality of LinSyn 


The full LINSYN procedure is shown in Algorithm 2. The correctness (i.e. sound- 
ness) of the synthesized bounds is guaranteed if the v; returned by Algorithm 1 
is a true lower bound on v(-). Since Algorithm 1 does not return until DREAL 
verifies v; at line 6, the correctness is guaranteed. 

Both our procedure in Section 3 and L-BFGS-B require only black-box access 
to a(-), so the only potential limit to the arbitrariness of our approach lies in 


368 Brandon Paulsen® and Chao Wang 


Algorithm 1: BoundViolation 


Input: Activation o(#1,x2,...), Candidate Coefficients ci, c3, ..., C441, 
Input Bounds zı € [l, u1], v2 € [l2, ua],..., Jacobian Vv (optional) 

Output: Lower Bound on Violation v; 

1 g + sample points on v(x1,22,...) and take minimum; 

2 vı + L-BFGS-B(v(a1, r2,...), 21 € [h, u1], v2 € [l2, ua],..., g, Vv) ; 

3 vu, u — 107% — 26; 

4 while True do 

5 // Call dReal 

6 if Equation 2 holds then 

7 | return v; 

8 end 

9 UL — UE 1076; 

10 


end 


Algorithm 2: SynthesizeUpperBoundCoefficients 


Input: Activation o(#1,22,...), Input Bounds zı € [l1, u1], x2 € [l2, u2];,..., 
Jacobian Vv (optional) 
Output: Sound Coefficients cf, C3, ..-, C441 
1 cf,c3,..., C441 + Sampling and LP procedure on o(x) over Input Bounds; 
2 vı 4 BoundViolation(c?, c3, ..., C441, 01 € (hi, u1], £2 € [l2, ua],..., Vv); 
3 Casa + Capi + (~u); 
4 return c], C2,...,Cd+1) 


what elementary operations are supported by DREAL. During our investigation, 
we did not find activations that use operations unsupported by DREAL, however 
if an unsupported operation is encountered, one would only need to define an 
interval extension [28] for the operation, which can be done for any computable 
function. 


5 Evaluation 


We have implemented our method in a module called LINSYN, and integrated 
it into the AUTOLIRPA neural network verification framework [47]. A user in- 
stantiates LINSYN with a definition of an activation function, which results in 
an executable software module capable of computing the sound linear lower and 
upper bounds for the activation function over a given input region. LINSYN uses 
Gurobi [17] to solve the LP problem described in Section 3, and DREAL [14] as 
the verifier described in 4. In total, LINSYN is implemented in about 1200 lines 
of Python code. 


5.1 Benchmarks 


Neural Networks Our benchmarks are nine deep neural networks trained on the 
three different datasets shown below. In the following, a neuron is a node in the 
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neural network where a linear bound must be computed, and thus the neuron 
counts indicate the number of calls to LINSYN that must be made. 


— MNIST: MNIST is a dataset of hand-written integers labeled with the 


corresponding integer in the image. The images have 28x28 pixels, with each 
pixel taking a gray-scale value between 0 to 255. We trained three variants of 
a 4-layer CNN (convolutional neural network). Each takes as input a 28x28 
= 784-dimensional input vector and outputs 10 scores, one for each class. 
In total, each network has 2,608 neurons — 1568, 784, and 256 in the first, 
second, and third layers, respectively. 

CIFAR: CIFAR is a dataset of RGB images from 10 different classes. The 
images have 32x32 pixels, with each pixel having an R, G, and B value in 
the range 0 to 255. We trained three variants of a 5-layer CNN. Each takes 
a 32x32x3 = 3072-dimensional input vector and outputs 10 scores, one for 
each class. In total, each network has 5376 neurons, 2048, 2048, 1024, and 
256 neurons in the first, second, third, and fourth layers, respectively. 
SST-2: The Stanford Sentiment Treebank (SST) dataset consists of sen- 
tences taken from movie reviews that are human annotated with either pos- 
itive or negative, indicating the sentiment expressed in the sentence. We 
trained three different variants of the standard LSTM architecture. These 
networks take as input a sequence 64-dimensional word embeddings and out- 
put 2 scores, one for positive and one for negative. Each network has a hidden 
size of 64, which works out to 384 neurons per input in the input sequence. 


Activation Functions We experi- 
mented with the four activation 
functions as shown in Fig. 6. 
GELU and Swish were recently 
proposed alternatives to the stan- 
dard ReLU function due to 
their desirable theoretical proper- 
ties [18] such as reduced overfit- 
ting [38], and they have seen use 
in OpenAl’s GPT [31] and very 
deep feed forward networks [32]. 
Similarly, Hard-Tanh is an op- 
timized version of the common 
tanh function, while the Log- 
Log function [16] is a sigmoid-like 
function used in forecasting. 


A 
0.5a(1 + tanh (\/2/7(a + 0.044715a*))) 
(GeLU) 
min(1, maa(a,—1)) (Hard Tanh) 
(Log-Log) 


a * a(x) (Swish) 


eT 
joce 


Fig. 6. Nonlinear activation functions. 


The Verification Problem The verification problem we consider is to certify that 
an input is robust to bounded perturbations of magnitude e€, where e is a small 
number. Certifying means proving that the classification result of the neural 
network does not change in the presence of perturbations. We focus on le ro- 
bustness, where we take an input x € R” and allow a bounded perturbation of 
+/ — € to each element in x. For each network, we take 100 random test inputs, 
filter out those that are incorrectly classified, apply an e bounded perturbation 
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Table 1. Comparing certified accuracy and run time of LINSYN and AUTOLIRPA. 


: AUTOLIRPA [47] |Our Method (new) 

Network Architecture (% certified|time (s)|% certified] time(s) 
MNIST|4-Layer CNN with Swish 0.34 15) 0.76 796 
4-Layer CNN with Gelu 0.01 359 0.72 814 
4-Layer CNN with Log Log 0.00 38 0.24 867 

CIFAR |5-Layer CNN with Swish 0.03 69} 0.35 1,077 
5-Layer CNN with Gelu 0.00 1,217 0.31 1,163 

5-Layer CNN with Log Log 0.59 98 0.69 717 

SST-2 |LSTM with sig tanh 0.93 37| 0.91 1,074 
LSTM with hard tanh - -| 0.64 2300 

LSTM with log log 0.16 1,072} 0.82 2,859 


Table 2. Comparing certified accuracy and run time of LINSYN and POPQORN. 


POPQORN [25] |Our Method (new)| 


[% certified|time (s)|% certified] time(s)| 
SST-2/LSTM with sig tanh 0.93 1517 0.90 1,074] 


Network Architecture 


to the correctly classified inputs, and then attempt to prove the classification re- 
mains correct. We choose € values common in prior work. For MNIST networks, 
in particular, we choose € = 8/255. For CIFAR networks, we choose e€ = 1/255. 
For SST-2 networks, we choose € = 0.04, and we only apply it to the first word 
embedding in the input sequence. 


5.2 Experimental Results 


Our experiments were designed to answer the following two questions: (1) How 
do LINSYN’s linear bounds compare with handcrafted bounds? (2) How does 
the runtime of LINSYN compare to state-of-the-art linear bounding techniques? 
To answer these questions, we compare the effectiveness of LINSYN’s linear 
bounds with the state-of-the-art linear bounding technique implemented in AU- 
TOLIRPA. To the best of our knowledge this is the only tool that can handle the 
activation functions we use in our benchmarks. As another comparison point, 
we also compare with POPQORN, a state-of-the-art linear bounding technique 
for LSTM networks. POPQORN tackles the challenge of computing tight linear 
bounds for sigmoid(«) x tanh(y) and x x sigmoid(y) using an expensive gradient 
descent based approach, and thus makes a good comparison point for runtime 
and accuracy. Our experiments were conducted on a computer with an Intel 2.6 
GHz i7-6700 8-core CPU and 32GB RAM. Both AUTOLIRPA and LINSYN are 
engineered to bound individual neurons in parallel. We configure each method 
to use up to 6 threads. 


Overall Comparison First, we compare the overall performance of our new 
method and the default linear bounding technique in AUTOLIRPA. The re- 
sults are shown in Table 1. Here, Columns 1-2 show the name of the dataset and 
the type of neural networks. Columns 3-4 show the results of the default Au- 
TOLIRPA, including the percentage of inputs certified and the analysis time in 
seconds. Similarly, Columns 5-6 show the results of our new method. 
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SYN and AUTOLIRPA. 


The results in Table 1 show that, in terms of the analysis time, our method is 
slower, primarily due to the use of constraint solvers (namely DREAL and the LP 
solver) but overall, the analysis speed is comparable to AUTOLIRPA. However, 
in terms of accuracy, our method significantly outperforms AUTOLIRPA. In 
almost all cases, our method was able to certify a much higher percentage of the 
inputs. For example, LINSYN more than quadruples the certified robustness of 
the LSTM with log log benchmark, and handles very well the relatively complex 
GeLU function. As for SST-2: LSTM with hard tanh, AUTOLIRPA does not 
support the general maz(zx, y) operation, so a comparison is not possible without 
significant engineering work. 

The only exception to the improvement is SST-2: LSTM with sig tanh, for 
which the results are similar (.93 versus .91). In this case, there is likely little 
to be gained over the default, decomposition-based approach of AUTOLIRPA in 
terms of tightness because the inputs to sigmoid(x) xtanh(y) and x x sigmoid(y) 
are not related, i.e., x and y are two separate variables. This is in contrast to, 
e.g., swish(x) = x x sigmoid(x), where the left-hand side and right-hand side 
of the multiplication are related. 

In Table 2, we show a comparison between LINSYN and POPQORN. The 
result shows that our approach achieves similar certified robustness and runtime, 
even though POPQORN was designed to specifically target this particular type 
of LSTM architecture, while LINSYN is entirely generic. 


Detailed Comparison Next, we perform a more in depth comparison of accuracy 
by comparing the widths of the final output neuron’s intervals that are computed 
by AUTOLIRPA and LINSYN. The results are shown in the scatter plot in Fig. 7 
and the histogram in Fig. 8. Each point in the scatter plot represents a single 
output neuron 2; for a single verification problem. The z-axis is the width of 
the interval of the output neuron 2; (i.e. u; — l;) computed by LINSYN, and 
the y-axis is the width computed by AUTOLIRPA. A point above the diagonal 
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line indicates that LINSYN computed a tighter (smaller) final output interval. 
In the histogram, we further illustrate the accuracy gain as the width ratio, 
measured as AUTORREA | Overall, the results show that LINSYN is more accurate 
in nearly all cases, and LINSYN often produces final output bounds 2-5X tighter 


than AUTOLIRPA. 


6 Related Work 


Linear Bound-based Neural Network Verification There is a large body of work 
on using linear-bounding techniques [36, 48, 34, 6, 45, 29, 30, 46, 27] and 
other abstract domains such as concrete intervals, symbolic intervals [44], and 
Zonotopes [15], for the purpose of neural network verification. All of these can 
be thought of as leveraging restricted versions of the polyhedral abstract do- 
main [10, 9]. To the best of our knowledge, these approaches are the most scal- 
able (in terms of network size) due to the use of approximations, but this also 
means they are less accurate than exact approaches. In addition, all these ap- 
proaches have the limitation that they depend on bounds that are hand-crafted 
by an expert. 

SMT solver-based Neural Network Verification There is also a large body 
of work on using exact constraint solving for neural network verification. Early 
works include solvers specifically designed for neural networks, such as Reluplex 
and Marabou [23, 24] and others [11], and leveraging existing solvers [12, 20, 
5, 20, 4, 40, 19]. While more accurate, the reliance on an SMT solver typically 
limits their scalability. More recent work often uses solvers to refine the bounds 
computed by linear bounding [35, 37, 43, 42, 41]. Since the solvers leveraged in 
these approaches usually involve linear constraint solving techniques, they are 
usually only applicable to piece-wise linear activation functions such as ReLU 
and Max/Min-pooling. 


7 Conclusions 


We have presented LINSYN, a method for synthesizing linear bounds for arbi- 
trary activation functions. The key advantage of LINSYN is that it can handle 
complex activation functions, such as Swish, GELU, and Log Log as a whole, 
allowing it to synthesize much tighter linear bounds than existing tools. Our 
experimental results show this increased tightness leads to drastically increased 
certified robustness, and tighter final output bounds. 
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Abstract. Theories and tools based on multiparty session types offer 
correctness guarantees for concurrent programs that communicate using 
message-passing. These guarantees usually come at the cost of an intrin- 
sically top-down approach, which requires the communication behaviour 
of the entire program to be specified as a global type. 

This paper introduces kmclib: an OCaml library that supports the de- 
velopment of correct message-passing programs without having to write 
any types. The library utilises the meta-programming facilities of OCaml 
to automatically infer the session types of concurrent programs and ver- 
ify their compatibility (kK-MC [15]). Well-typed programs, written with 
kmclib, do not lead to communication errors and cannot get stuck. 


Keywords: Multiparty Session Types - Concurrent Programming - OCaml 


1 Introduction 


Multiparty session types (MPST) [5] are a popular type-driven technique to 
ensure the correctness of concurrent programs that communicate using message- 
passing. The key benefit of MPST is to guarantee statically that the components 
of a program have compatible behaviours, and thus no components can get per- 
manently stuck. Many implementations of MPST in different programming lan- 
guages have been proposed in the last decade [2,4,6,10,12,16-18,20, 23], however, 
all suffer from a notable shortcoming: they require programmers to adopt a top- 
down approach that does not fit well in modern development practices. When 
changes are frequent and continual (e.g., continuous delivery), re-designing the 
program and its specification at every change is not feasible. 

Most MPST theories and tools advocate an intrinsically top-down approach. 
They require programmers to specify the communication (often in the form of 
a global type) of their programs before they can be type-checked. In practice, 
type-checking programs against session types is very difficult. To circumvent the 
problem, most implementations of MPST rely on external toolings that generate 
code from a global type, see e.g., all works based on the Scribble toolchain [22]. 

In this paper, we present an OCaml library, called kmclib [8,9], which sup- 
ports the development of programs that enjoy all the benefits of MPST while 
avoiding their main drawbacks. The kmclib library guarantees that threads in 
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Fig. 1: Workflow of the kmclib library (the PPX plugin is the shaded box). 


well-typed programs will not get stuck. The library also enables bottom-up devel- 
opment: programmers write message-passing programs in a natural way, without 
having to write session types. Our library is built on top of Multicore OCaml [21] 
that offers highly scalable and efficient concurrent programming, but does not 
provide any static guarantees wrt. concurrency. 

Figure 1 gives an overview of kmclib. Its implementation combines the power 
of the type-aware macro system of OCaml (Typed PPX) with two recent ad- 
vances in the session types area: an encoding of MPST in OCaml (channel vector 
types [10]) and a session type compatibility checker (k-MC checker [15]). To our 
knowledge, this is the first implementation of type inference for MPST and the 
first integration of compatibility checking in a programming language. 

The kmc1lib library [8,9] offers several advantages compared to earlier MPST 
implementations. (1) It is flexible: programmers can implement communica- 
tion patterns (e.g., fire-and-forget patterns [15]) that are not expressible in the 
synchrony-oriented syntax of global types. (2) It is lightweight as it piggybacks 
on OCaml’s type system to check and infer session types, hence lifting the burden 
of writing session types off the programmers. (3) It is user-friendly thanks to its 
integration in Visual Studio Code, where compatibility violations are mapped 
to precise locations in the code. (4) It is well-integrated into the natural edit- 
compile-run cycle. Although compatibility is checked by an external tool, this 
step is embedded as a compilation step and thus hidden from the user. 


2 Safe Concurrent Programming in Multicore OCaml 


We give an overview of the features and usage of kmclib using the program 
in Figure 2 (top) which calculates Fibonacci numbers. The program consists of 
three concurrent threads (user, master, and worker) that interact using point- 
to-point message-passing. Initially, the user thread sends a request to the master 
to start the calculation, then waits for the master to return a work-in-progress 
message, or the final result. After receiving the result, the user sends back a stop 
message. Upon receiving a new request, the master splits the initial computation 
in two and sends two tasks to a worker. For each task that the worker receives, it 
replies with a result. The master and worker threads are recursive and terminate 
only upon receiving a stop message. 
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1 let KMC (uch,mch,wch) = [%kmc.gen (u,m,w)] 


2 22 let master () = 

3 let user () = 23 let rec loop (mch : [%kmc.check u]) : unit = 
4 let uch = send uch#m#compute 42 in 24 match receive mch#u with 

5 let rec loop uch : unit = 25 | “compute(x, mch) -> 

6 match receive uch#m with 26 let mch = send mch#w#task (x - 2) in 

7 | `wip(res, uch) -> 27 let mch = send mch#w#task (x - 1) in 

8 printf "in progress: %d\n" res; 28 let ~result(ri, mch) = receive mch#w in 
9 loop uch 29 let mch = send mch#u#wip ri in 

10 | “result(res, uch) -> 30 let ~result(r2, mch) = receive mch#w in 
11 printf "result: %d\n" res; 31 loop (send mch#u#result (r1 + r2)) 

12 send uch#m#stop () 32 | `stop((), mch) -> 

13 in loop uch 33 send mch#w#stop () 

14 34 in loop mch 

15 let worker () = 35 

16 let rec loop wch : unit = 36 let () = 

17 match receive wch#m with 37 let ut = Thread.create user () in 

18 | “task(num, wch) -> 38 let mt = Thread.create master () in 

19 loop (send wch#m#result (fib num)) 39 let wt = Thread.create worker © in 

20 | `stop((), wch) -> wch 40 List.iter Thread.join [ut;mt;wt] 


21 in loop wch 


mu!result Umo 
u: um! compute mimp m: Îwm?result um? compute W: 
mu? result nu! wip ok mw!stop 
um! stop wm?result mw! task ght? 


Fig. 2: Example of kmclib program (top) and inferred session types (bottom). 


Figure 2 (bottom) gives a session type for each thread, i.e., the behaviour 
of each thread wrt. communication. For clarity we represent session types as 
a communicating finite state machines (CFSM [1]), where ! (resp. ?) denotes 
sending (resp. receiving). For example, um!compute means that the user is sending 
to the master a message compute, while um? compute says that the master receives 
compute from the user. Our library infers these CFSM representations from the 
OCaml code, in Figure 2 (top), and verifies statically that the three threads 
are compatible, hence no thread can get stuck due to communication errors. If 
compatibility cannot be guaranteed, the compiler reports the kind of violations 
(i.e., progress or eventual reception error) and their locations in the code. Figure 3 
shows how such semantic errors are reported visually in Visual Studio Code. 

Albeit simple, the common communication pattern used in Figure 2 can- 
not be expressed as a global type, and thus cannot be implemented in previous 
MPST implementations. Concretely, global types cannot express the intrinsi- 
cally asynchronous interactions between the master and worker threads (i.e., the 
master may send a second task message, while the worker sends a result). 


Programming with kmclib. To enable safe message-passing programs, kmclib 
provides two communication primitives, send and receive, and two primitives 
for channel creation (KMC and %kmc.gen). Our library supports all the features of 
traditional MPST implementations and have similar limitations (fixed number 
of participant, no delegation, etc). We only give a user-oriented description of 
these primitives here (see [7, §A] for an overview of their implementations). 
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29 let mch = send mch#u#wip r1 in 30 (* let ‘result(r2, mch) = receive mch#w in *) 
30 let ‘result(r2, mch) = receive mch#w in 31 loop (send mch#u#result r1) 

An. K~ 
® test.ml 3 of 5 problems @ test.ml 3 of 5 problems 
This expression has type [ ‘progress_violation ] This expression has type [ ‘eventual_reception_violation ] 
It has no method w ocamllsp It has no method u ocamllsp 


Fig. 3: Examples of type errors: progress (left) and eventual reception (right). 


The crux of kmclib is the session channel creation: [%kmc.gen (u,m,w)] at 
Line 1. This primitive takes a tuple of role names as argument (i.e., (u,m,w)) and 
returns a tuple of communication channels, which are bound to (uch,mch,wch). 
These channels will be used by the threads implementing roles user (Lines 3- 
13), worker (Lines 15-21), and master (Lines 22-34). Channels are implemented 
using concurrent queues from Multicore OCaml (Domainslib.Chan.t) but other 
underlying transports can easily be provided. 

Threads send and receive messages over these channels using the commu- 
nication primitives provided by kmclib. The send primitive requires three ar- 
guments: a channel, a destination role, and a message. For instance, the user 
sends a request to the master with send uch#m#compute 20 where uch is the user’s 
communication channel, m indicates the destination, and compute 20 is the mes- 
sage (consisting of a label and a payload). Observe that a sending operation 
returns a new channel which is to be used in the continuation of the interac- 
tions, e.g., uch bound at Line 4, which must be used linearly (see [7] for details). 
Receiving messages works in a similar way to sending messages, e.g., see Line 6 
where the user waits for a message from the master with receive uch#m. We use 
OCaml’s pattern matching to match messages against their labels and bind the 
payload and continuation channel. See, e.g., Lines 7-10 where the user expects 
either a wip or result message. The receive primitive returns the payload res 
and a new communication channel uch. 

New thread instances are spawned in the usual way; see Lines 36-39. The 
code at Line 40 waits for them to terminate. 


Compatibility and error reporting. While the code in Figure 2 may ap- 
pear unremarkable, it hides a substantial machinery that guarantees that, if a 
program type-checks, then its constituent threads are safe, i.e., no thread gets 
permanently stuck and all messages that are sent are eventually received. This 
property is ensured by kmclib using OCaml’s type inference and PPX plugins 
to infer a session type from each thread then check whether these session types 
are k-multiparty compatible (k-MC) [15]. 

If a system of session types is k-MC, then it is safe [15, Theorem 1], i.e., it 
has the progress property (no role gets permanently stuck in a receiving state) 
and the eventual reception property (all sent messages are eventually received). 
Checking k-MC notably involves checking that all their executions (where each 
channel contains at most k messages) satisfy progress and eventual reception. 

The k-MC-checker [15] performs a bounded verification to discover the least 
k for which a system is k-MC, up-to a specified upper bound JN. In the kmclib 
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(1) Read the AST (2) Call the typechecker & Extract the type 
let (uch,mch,wch) = [%kmc.gen fib (u,m,w)] <m: <compute: ...> > 
(Type inference propagates) * <u: [ compute of ..] > 
(4,5) * <m: | task of ..] > 
send uch#m#compute ... : í 
Instrumentation (3) Translate & Invoke verifier 
match receive mch#u with `compute .. u: m!compute<int»;... 


m: u?compute<int>;... 


match receive wch#m with “task ... w: mtask<int>;... 


Fig. 4: Inference of session types from OCaml code. 


API, this bound can be optionally specified with [%kmclib.gen roles ~bound: N]. 
The k-MC-checker emits an error if the bound is insufficient to guarantee safety. 

The [%kmc.gen (u,m,w)] primitive also feeds the results of k-MC checking 
back to the code. If the inferred session types are kK-MC, then channels for roles 
u, mand wcan be generated, otherwise a type error is raised. We have modified the 
k-MC-checker to return counterexample traces when the verification fails. This 
helps give actionable feedback to the programmer, as counterexample traces are 
translated to OCaml types and inserted at the hole corresponding to [%kmc.gen]. 
This has the effect of reporting the precise location of the errors. 

To report errors in a function parameter, we provide an optional macro for 
types: [%kmc.check rolename] (see faded code in Line 23). Figure 3 shows ex- 
amples of such error reports. The left-hand-side shows the reported error when 
Line 26 is commented out, i.e., the master sends one task, but expects two result 
messages; hence progress is violated since the master gets stuck at Line 30. The 
right-hand-side shows the reported error when Line 30 is commented out. In this 
case, variable mch in Line 31 (master) is highlighted because the master fails to 
consume a message from channel mch. 


3 Inference of Session Types in kmclib 


The kmclib API. The kmclib primitives allow the vanilla OCaml typechecker 
to infer the session structure of a program, while simultaneously providing a user- 
friendly communication API for the programmer. To enable inference of session 
types from concurrent programs, we leverage OCaml’s structural typing and row 
polymorphism. In particular, we reuse the encoding from [10] where input and 
output session types are encoded as polymorphic variants and objects in OCaml. 
In contrast to [10] which relies on programmers writing global types prior to type- 
checking, kmclib infers and verifies local session types automatically, without 
requiring any additional type or annotation. 


Typed PPX Rewriter. To extract and verify session types from a piece of 
OCaml code, the kmclib library makes use of OCaml PreProcessor eXtensions 
(PPX) plugins which provide a powerful meta-programming facility. PPX plu- 
gins are invoked during the compilation process to manipulate or translate the 
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abstract syntax tree (AST) of the program. This is often used to insert additional 
definitions, e.g., pretty-printers, at compile-time. 

A key novelty of kmclib is the combination of PPX with a form of type- 
aware translation, whereas most PPX plugins typically perform purely syntactic 
(type-unaware) translations. Figure 4 shows the workflow of the PPX rewriter, 
overlayed on code snippets from Figure 2. The inference works as follows. 


(1) The plugin reads the AST of the program code to replace the [%kmc.gen] 
primitive with a hole, which can have any type. 

(2) The plugin invokes the typechecker to get the typed AST of the program. In 
this way, the type of the hole is inferred to be a tuple of channel object types 
whose structure is derived from their usages (i.e., mch#u#compute). 

To enable this propagation, we introduce the idiom “let (KMC ...) = ...” which 
enforces the type of the hole to be monomorphic. Otherwise, the type would be 
too general and this would spoil the type propagation, see [7, § B]. 

(3) The inferred type is translated to a system of (local) session types, which 
are passed to the k-MC-checker. 

(4) If the system is k-MC, then it is safe and the plugin instruments the code 
to allocate a fresh channel tuple (i.e., concurrent queues) at the hole. 

(5) Otherwise, the k-MC-checker returns a violation trace which is translated 
back to an OCaml type and inserted at the hole, to report a more precise error. 


The translation is limited inside the [%kmc.gen] expression, retaining a clear 
correspondence between the original and translated code. It can be understood 
as a form of ad hoc polymorphism reminiscent of type classes in Haskell. Like the 
Haskell typechecker verifies whether a type belongs to a class or not, the kmclib 
verifies whether the set of session types belongs to the class of k-MC systems. 


4 Conclusion 


We have developed a practical library for safe message-passing programming. 
The library enables developers to program and verify arbitrary communication 
patterns without the need for type annotations or user-operated external tools. 
Our automated verification approach can be applied to other general-purpose 
programming languages. Indeed it mainly relies on two ingredients: static struc- 
tural typing and metaprogramming facilities. Both are available, with a varying 
degree of support, in, e.g., Scala, Haskell, TypeScript, and F#. 

Our work is reminiscent of automated software model checking which has 
a long history (see [11] for a survey). There are few works on inference and 
verification of behavioural types, i.e., [3, 13, 14,19]. However, Perera et al. [19] 
only present a prototype research language, while Lange et al. [3, 13,14] propose 
verification procedures for Go programs that rely on external tools which are 
not integrated with the language nor its type system. To our knowledge, ours is 
the first implementation of type inference for MPST and the first integration of 
session types compatibility checking within a programming language. 
Acknowledgements. This work is partially supported by KAKENHI 17K 12662, 
21K11827, 21H03415, and EPSRC EP/W007762/1. 
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Abstract. Runtime verification (RV) enables monitoring systems at 
runtime, to detect property violations early and limit their potential 
consequences. This paper presents an end-to-end framework to capture 
requirements in structured natural language and generate monitors that 
capture their semantics faithfully. We leverage NASA’s Formal Require- 
ment Elicitation Tool (FRET), and the RV system CoPILOT. We extend 
FRET with mechanisms to capture additional information needed to gener- 
ate monitors, and introduce OGMA, a new tool to bridge the gap between 
FRET and COPILOT. With this framework, users can write requirements 
in an intuitive format and obtain real-time C monitors suitable for use in 
embedded systems. Our toolchain is available as open source. 


1 Introduction 


Safety-critical systems, such as aircraft, automobiles, and power systems, where 
failure can result in injury or death of a human [23], must undergo extensive 
assurance. The verification process must ensure that the system satisfies its 
requirements under realistic operating conditions and that there is no unintended 
behavior. Verification rests on possessing a precise statement of requirements, 
arguably one of the most difficult tasks in engineering reliable software. 

Runtime verification (RV) has the potential to enable the safe 
operation of complex safety-critical systems. RV monitors can be used to detect 
and respond to property violations during missions, as well as to verify implemen- 
tations and simulations at design time. For monitors to be effective, they must 
faithfully reflect the mission requirements, which is difficult for non-trivial sys- 
tems because correctness properties must be expressed in a precise mathematical 
formalism while requirements are generally written in natural language. 

The focus of this paper is to provide an end-to-end framework that takes 
as input requirements and other necessary data and provides mechanisms to 
1) help the user deeply understand the semantics of these requirements, 2) au- 
tomatically generate formalizations and 3) produce RV monitors that faithfully 
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Fig. 1: Step-by-step workflow 


capture the semantics of the requirements. We leverage NASA’s Formal Re- 
quirement Elicitation Tool (FRET) and the runtime monitoring system 
COPILOT [35]. FRET allows users to express and understand requirements 
through its intuitive structured natural language (named FRETISH) and elicitation 
mechanisms, and generates formalizations in temporal logic. COPILOT allows 
users to specify monitors and compile them to hard real-time C code. 

The contribution of this paper is the tight integration of the FRET-COPILOT 
tools to support the automated synthesis of executable RV monitors directly 
from requirement specifications. In particular, we present: 

— A new tool, named OGMA, that receives requirement formalizations and 
variable data from FRET and compiles these into COPILOT monitors. 
— An extension of the FRET analysis portal to support the generation and 
export of specifications that can be directly digested by OGMA. 
— Preliminary experimental results that evaluate the proposed workflow. 
All tools needed by our workflow are available as open source [4]. 


Related Work. A number of runtime verification languages and systems have 


been applied in resource-constrained environments l6] [28]. In 


contrast to our work, these systems do not provide a direct translation from 
natural language. Several tools formalize natural-language 
like requirements, but not for the purpose of generating runtime monitors. The 
STIMULUS tool allows users to express requirements in an extensible, 
natural-like language that is syntactic sugar for hierarchical state machines. 
The machines then act as monitors that can be used to validate requirements 
during the design and testing phases, but are not intended to be used at runtime. 
FLEA is a formal language for expressing requirements that compiles to 
runtime monitors in a garbage collected language, making it harder to use in 
embedded systems; in contrast, our approach generates hard real-time code. 


2 Step-by-step Framework Workflow 


To integrate FRET and COPILOT, we extended the FRET analysis portal and 
created the OGMA tool. Figure [I]shows the step-by-step workflow of the complete 
framework - dashed lines represent the newly added steps (2, 3, and 4). Once 
requirements are written in FRETISH, FRET helps users understand and refine 
their requirements through various explanations and simulation (step 0). Next, 
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NL: “While flying, if the airspeed is below 100 m/s, the autopilot shall increase 
the airspeed to at least 100 m/s within 10 seconds.” 


FRETish: in flight mode if airspeed < 100 the aircraft shall within 
10 seconds satisfy (airspeed >= 100) 


pmLTL: H (Lin-flight—(Y (((Oj<10)(((airspeed < 100) & ((Y (! (airspeed < 100))) | 
Fin_flight)) & (!(airspeed > 100)))) — (Oj<io9)(Fin-flight | (airspeed > 100)))) S 
(((O;<10) (C(airspeed < 100) & ((Y (!(airspeed < 100))) | Fin-flight)) & (!(airspeed > 
100)))) — (Oj<io) (Fin-flight | (airspeed > 100)))) & Fin_flight)))) & ((!Lin_flight) 
S ((!Lin flight) & Fin flight)) — (((Oj—10)(((airspeed < 100) & ((Y (!(airspeed < 
100))) | Fin-flight)) & (!(airspeed > 100)))) — (Oj<io)(Fin-flight | (airspeed > 
100)))) S (((Oj=10] (( (airspeed < 100) & ((Y (! (airspeed < 100))) | Fin-flight)) & 

(! (airspeed > 100)))) —> (Oj<10) (Fin-flight | (airspeed > 100)))) & Fin_flight)), 


where Fin-flight (First timepoint in flight mode) is flight & (FTP | Y !flight), Lin-flight 
(Last timepoint in flight mode) is !flight & Y flight, FTP (First Time Point) is ! Y true. 


Fig. 2: Running example in Natural Language (NL), FRETISH, and pmLTL forms. 


FRET automatically translates requirements (step 1) into pure Past-time Metric 
Linear Temporal Logic (pmLTL) formulas. Next, information about the variables 
referenced in the requirements must be provided by the user (step 2). The 
formulas, as well as the provided variables’ data, are then combined to generate 
the Component Specification (step 3). Based on this specification, OGMA creates 
a complete COPILOT monitor specification (step 4). COPILOT then generates 
the C Monitor (step 5), which is given along with other C code (step 6) to a C 
Compiler for the generation (step 7) of the final object code. 


Running Example. The next sections illustrate each workflow step using a 
flight-critical system requirement: airplanes should always avoid stalling (a stall 
is a sudden loss of lift, which may lead to a loss of control). To avoid stalls, they 
should fly above a certain speed, known as stall speed (as well as stay below a 
critical angle of attack). Our running requirement example is captured in natural 
language in Figure [2 For the purposes of this example, we consider the airspeed 
threshold to be 100 m/s and the correction time to be 10 seconds. 


3 FRET Steps 


Next we discuss FRET, the requirements tool that constitutes our frontend. 


Step 0: fretish and semantic nuances. A FRETISH requirement (see running 
example in Figure 22) contains up to six fields: scope, condition, component*, 
shall*, timing, and response*. Fields marked with * are mandatory. 
component specifies the component that the requirement refers to (e.g., air- 
craft). shall expresses that the component’s behavior must conform to the 
requirement. response is of the form satisfy R, where R is a Boolean condition 
(e.g., satisfy airspeed > 100). scope specifies the period when the requirement 
holds during the execution of the system, e.g., when “in flight mode”. condition 
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is a Boolean expression that further constrains when the response shall occur 
(e.g., the requirement becomes relevant only upon airspeed < 100 becoming true). 
timing specifies when the response must occur (e.g., within 10 seconds). 


Getting a temporal require- 
ment right is usually a tricky task SAME GSE in every interval tude flight Jet PUSS he first point in 
the interval if (airspeed < 100) is true and any point in the interval where 


since such requirements are of- (airspeed < 100) becomes true (from false). REQUIRES: for every trigger, 
ten riddled with semantic sub- RES must hold at some point with distance <= 10 from the trigger (i.e., at 
tleties. To help thè user, FRET pe eal eta interval ends sooner than 
provides a simulator and seman- Tc 

tic explanations (17]. For exam- 

ple, the diagram in Figure [3] ex- 

plains that the requirement is 

only relevant within the grayed 

box M (while in flight mo de). TC M = flight, TC = (airspeed < 100), n = 10, Response = (airspeed >= 100). 
represents the triggering condi- 
tion (airspeed < 100) and the 
orange band, with a duration of n=10 seconds, states that the response 
(airspeed >= 100) is required to hold at least once within the 10 sec- 
onds duration, assuming that flight mode holds for at least 10 seconds. 


Step 1: fretish to pmLTL. For 


Fig. 3: FRET explanations 


. Update Variable 
each FRETISH requirement, FRET | Fret Proj FRET Com 
generates formulas in a variety of | [ACASParer oircraft 
formalisms. For the COPILOT in- | PTY% vials Type" 
airspeed Internal oa 


tegration, we use the generated 

pmLTL formulas (Figure[2) Clearly, | douse. : 

manually writing such formulas | vaiosie assignmentincopict 

can be quite error-prone, while | atitude> 

the FRET formalization process has | ‘one Nene 1 constant, ropi externi oP one 
been extensively tested through its Lustre J CoPilot 

formalization verifier (17). CANCEL 

Steps 2 & 3: Variables data 
and Component Specification. 
We extended FRET’s analysis portal to capture the information needed to 
generate Component Specifications for OGMA. To generate a specification, the 
user must indicate the type (i.e., input, output, internal) and data type (integer, 
Boolean, double, etc) of each variable (Figure [4) Internal variables represent ex- 
pressions of input and output variables; if the same expression is used in multiple 
requirements, an internal variable can be used to substitute it and simplify the 
requirements. The user must assign an expression to each internal variable. In 
our example, the flight internal variable is defined by the expression altitude 
> 0.0, where altitude is an input variable. Internal variable assignments can 
be defined in Lustre or Copilot [29]. Integrated Lustre and Copilot parsers 
identify parsing errors and return feedback (Figure (4p. Once steps 1 and 2 are 
completed, FRET generates a Component Specification, which contains all re- 
quirements in pmLTL and Lustre code, as well as variable data that belong to 
the same system component. 


Fig. 4: FRET variable editor 
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4 Ogma Steps 


OGMA is a command-line tool to produce monitoring applications. OGMA gener- 
ates monitors in COPILOT, and also supports integrating them into larger systems, 
such as applications built with NASA’s core Flight System (cFS) (40). 

Step 4: Copilot Monitors. OGMA provides a command fret-component-spec 
to process Component Specifications. The command traverses the Abstract Syn- 
tax Tree of the Component Specification, and converts each tree node into its 
CoOPILOT counterpart. Input and output variables in FRET become eztern streams 
in COPILOT, or time-varying sources of information needed by the monitors: 
airspeed :: Stream Double 

airspeed = extern "airspeed" Nothing 


Internal variables are also mapped to streams. Each requirement’s pmLTL formula 
is translated into a Boolean stream, paired with a C handler triggered when 
the requirement is violated. In the example below, the property we monitor is 
associated with a handler, handlerpropAvoidStall1, which must be implemented 
separately in C by the user to determine how to address property violations: 
propAvoidStall :: Stream Bool 

propAvoidStall = ((PTLTL.alwaysBeen ((((not (flight)) && ... ))))) 

spec = trigger "handlerpropAvoidStall" (not propAvoidStall) [] 


5 Copilot Steps 


COPILOT is a stream-based runtime monitoring language. COPILOT streams may 
contain data of different types. At the top level, specifications consist of pairs of 
Boolean streams, together with a C handler to be called when the current sample 
of a stream becomes true. For a detailed introduction to COPILOT, see [29]. 
Step 5: C Monitors. OGMA generates self-contained COPILOT monitoring 
specifications, which can be further compiled into C99 by just compiling and 
running the COPILOT specifications with a Haskell compiler. This process produces 
two files: a C header and a C implementation. 

Step 6: Larger Applications. The C files generated by COPILOT are designed 
to be integrated into larger applications. They provide three connection end- 
points: extern variables, a step function, and handler functions, which users 
implement to handle property violations. The code generated has no dynamic 
memory allocation, loops or recursive calls, it executes in predictable memory and 
time. For our running example, the header file generated by COPILOT declares: 
extern bool flight; extern float airspeed; 

void handlerpropAvoidStall(void); void step(void); 

Commonly, the calling application will poll sensors, write their values to 
global variables, call the step function, and implement handlers that log property 
violations or execute corrective actions. Users are responsible for compiling and 
linking the COPILOT code together with their application (step 7). 

We also used the running requirement in this paper to monitor a flight in 
the simulator X-Plane. We wrote an X-Plane plugin to show the state of the C 
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(a) Cruising (b) Stall (c) Recovery 


Fig. 5: Demonstration of COPILOT monitor running as X-Plane plugin. 


monitor and some additional information on the screen (Fig. Ba}. To test the code, 
we brought an aircraft to a stall by increasing the angle of attack, which also 
lowered the airspeed (Fig. [5bp. After 10 seconds below the specified threshold, the 
monitor became active, remaining on after executing a stall recovery (Fig. Bc). 


6 Preliminary Results 


We report on experiments with monitors generated from the publicly available 
Lockheed Martin Cyber-Physical System (LMCPS) challenge problems [12], 
which are a set of industrial Simulink model benchmarks and natural language 
requirements developed by domain experts. LMCPS requirements were previously 
written in FRETISH by a subset of the authors and were analyzed against 
the provided models using model checking. 

In this paper, we reuse the FRETISH requirements to generate monitors 
and compare our runtime verification results with the model checking results 
of [26]. For each Simulink model we generated C code through the automatic 
code generation feature of Matlab/Simulink. We then attached the generated C 
monitors to the C code and used the property-based testing system QuickCheck [o] 
to generate random streams of data, feed them to the system under observation, 
and report if any of the monitors were activated, based on [34]. 

We experimented with the Finite State Machine (FSM) and the Control 
Loop Regulators (REG) LMCPS challenges. For both challenges, our results 
are consistent with the model checking results - QuickCheck found inputs that 
activated the monitors, indicating that some requirements were not satisfied. 
Moreover, it returned results within seconds in cases where model checkers timed 
out. See for details on the results and for a reproducible artifact. 


7 Conclusion 


We described an end-to-end framework in which requirements written in struc- 
tured natural language can be equivalently transformed into monitors and be 
analyzed against C code. Our framework ensures that requirements and analysis 
activities are fully aligned: C monitors are derived directly from requirements and 
not handcrafted. The design of our toolchain facilitates extension with additional 
front-ends (e.g., JKind Lustre (15), and backends (e.g., R2U2 (38). In the future, 
we plan to explore more use cases, including some from real drone test flights. 
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Abstract. We present MaskD, an automated tool designed to measure 
the level of fault-tolerance provided by software components. The tool 
focuses on measuring masking fault-tolerance, that is, the kind of fault- 
tolerance that allows systems to mask faults in such a way that they 
cannot be observed by the users. The tool takes as input a nominal model 
(which serves as a specification) and its fault-tolerant implementation, 
described by means of a guarded-command language, and automatically 
computes the masking distance between them. This value can be under- 
stood as the level of fault-tolerance provided by the implementation. The 
tool is based on a sound and complete framework we have introduced in 
previous work. We present the ideas behind the tool by means of a sim- 
ple example and report experiments realized on more complex case studies. 


1 Introduction 


Fault-tolerance is an important characteristic of critical software. It can be 
defined as the capability of systems to deal with unexpected events, which may 
be caused by code bugs, interaction with an uncooperative environment, hardware 
malfunctions, etc. Examples of fault-tolerant systems can be found everywhere: 
communication protocols, hardware circuits, avionic systems, cryptocurrencies, 
etc. So, the increasing relevance of critical software in everyday life has led to a 
renewed interest in the automatic verification of fault-tolerant properties. However, 
one of the main difficulties when reasoning about these kinds of properties is given 
by their quantitative nature, which is true even for non-probabilistic systems. A 
simple example is given by the introduction of redundancy in critical systems. 
This is, by far, one of the most used techniques in fault-tolerance. In practice, it 
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is well known that adding more redundancy to a system increases its reliability. 
Measuring this increment is a central issue for evaluating fault-tolerant software. 
On the other hand, there is no de-facto way to formally characterize fault-tolerant 
properties. Thus these properties are usually encoded using ad-hoc mechanisms 
as part of a general design. 

The usual flow for the design and verification of fault-tolerant systems consists 
of defining a nominal model (i.e., the “fault-free” or “ideal” program) and 
afterwards extending it with faulty behaviors that deviate from the normal 
behavior prescribed by the nominal model. This extended model represents the 
way in which the system operates under the occurrence of faults. More specifically, 
a model extension enriches a transition system by adding new (faulty) states and 
transitions from and to those states, namely fault-tolerant implementation. 

On the other hand, during the last decade, significant progress has been made 
towards defining suitable metrics or distances for diverse types of quantitative 
models including real-time systems [11], probabilistic models [7], and metrics for 
linear and branching systems [5,2,10,13,19]. Some authors have already pointed 
out that these metrics can be useful to reason about the robustness of a system, 
a notion related to fault-tolerance. 

We present MaskD, an automated tool designed to measure the level of fault- 
tolerance among software components, described by means of a guarded-command 
language. The tool focuses on measuring masking fault-tolerant components, that 
is, programs that mask faults in such a way that they cannot be observed by the 
environment. It is often classified as the most beneficial kind of fault-tolerance 
and it is a highly desirable property for critical systems. The tool takes as 
input a nominal model and its fault-tolerant implementation and automatically 
computes the masking distance between them. It is based on a framework we 
have introduced in [4], and shown to be sound and complete. In Section 2 we 
give a brief introduction to this framework. 

The tool is well suited to support engineers for the analysis and design of 
fault-tolerant systems. More precisely, it uses a computable masking distance 
function such that an engineer can measure the masking tolerance of a given 
fault-tolerant implementation, i.e., the number of faults that the implementation 
is able to mask in the worst case. Thereby, the engineers can measure and compare 
the masking fault-tolerance distance of alternative fault-tolerant implementations, 
and select one that best fits their preferences. 


2 The MaskD Tool 


MaskD takes as input a nominal model and its fault-tolerant implementation, 
and produces as output the masking distance between them, which is a value in 
the interval [0,1]. The input models are described using the guarded command 
language introduced in [3], a simple programming language common for describing 
fault-tolerant algorithms. More precisely, a program is a collection of processes, 
where each process is composed of a collection of labelled actions of the style: 
[Label] Guard — Command, where Guard is a Boolean condition over the actual 
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Process MEMORY { Process MEMORY_FT { 


w : BOOL; // the last value written w : BOOL; 
r : BOOL; // the value read from the r : BOOL; 


// memory c0 : BOOL; // first bit 
cO : BODL; c1 : BOOL; // second bit 
Initial: w && cO && x; c2 $ BOOL; // third pit 
[write1] true -> w=true, cO=true, Initial: w && cO && c1 && c2 &k& r; 
r=true; [write1] true -> w=true, cO=true, ci=true, 
[write0] true -> w=false, c0=false, c3=true, r=true; 
r=false; [write0] true -> w=false, cO=false, ci=false, 
[read0] !r -> r=r; c3=false, r=false; 
[readi] r -> r=r; [read0] !r -> r=r; 
} [readi] r -> r=r; 
[fail1] faulty true -> c0=!c0, r =(!cO&&c1) || (c1&&c2) | | 
(!cO&&c2); 
[fail2] faulty true -> c1=!c1, r =(cO&&!c1) || (!c1k&c2) | | 
(cO&&c2); 
[fail3] faulty true -> c2=!c2, r =(c0&&c1)||(ci&&!c2)ll 
(cO&&!c2); 
} 


Fig. 1. Processes for a memory cell example. On the left is the Nominal Model and on 
the right is the Fault-tolerant Model. 


state of the program, Command is a collection of basic assignments, and Label 
is a name for the action. These syntactic constructions are called actions. The 
language also allows users to label an action as internal (i.e., silent actions). This 
is important for abstracting away internal parts of the system and building large 
models. Moreover, some actions can be labeled as faulty to indicate that they 
represent faults. 


In order to compute the masking distance between two systems the tool uses 
notions coming from game theory. More precisely, a two-player game (played by 
the Refuter (R) and the Verifier (V)) is constructed using the two models. The 
intuition of this game is as follows. The Refuter chooses transitions of either the 
specification or the implementation to play, and the Verifier tries to match her 
choice. However, when the Refuter chooses a fault, the Verifier must match it 
with a masking transition. R wins if the game reaches the error state, denoted 
Verr. On the other hand, V wins when verr is not reached during the game. 
Rewards are added to certain transitions in the game to reflect the fact that a 
fault was masked. Thus, given a play (a maximal path in the game graph) a 
function fmask computes the value of the play: if it reaches the error state, the 
value is inversely proportional to the number of masking movements made by 
the Verifier; if the play is infinite, it receives a value of 0 indicating that the 
implementation was able to mask all the faults in the path. Summing up, the 
fault-tolerant implementation is masking fault-tolerant if the value of the game is 
0. Furthermore, the bigger the number, the farther the masking distance between 
the fault-tolerant implementation and the specification. 


As a running example, we consider a memory cell that stores a bit of infor- 
mation and supports reading and writing operations, presented in a state-based 
form in [6]. A state in this system maintains the current value of the memory cell 
(m = i, for i = 0,1), writing allows one to change this value, and reading returns 
the stored value. In this system the result of a reading depends on the value 
stored in the cell. Thus, a property that one might associate with this model is 
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Fig. 2. Architecture of MaskD. 


that the value read from the cell coincides with that of the last writing performed 
in the system. 

A potential fault in this scenario occurs when a cell unexpectedly loses its 
charge, and its stored value turns into another one (e.g., it changes from 1 to 0 
due to charge loss). A typical technique to deal with this situation is redundancy: 
in this case, three memory bits are used instead of only one. Writing operations 
are performed simultaneously on the three bits. Reading, on the other hand, 
returns the value that is repeated at least twice in the memory bits; this is 
known as voting. Figure 1 shows the processes representing the nominal and the 
fault-tolerant implementation of this example. 


2.1 Architecture 


MaskD is open source software written in Java. Documentation and installation 
instructions can be found at [1]. The architecture of the tool is shown in Fig. 2. 
We briefly describe below the key components of the tool: 


Parser Module. It performs basic syntactic analysis over the input models, 
and produces data structures describing the inputs. Libraries Cup and JFlex 
were used to automatically generate the parser from the grammar describing 
the modeling language. 

LTS Translation. The models obtained from the parser are translated into 
Labeled Transition Systems (LTSs), i.e., graphs whose vertices represent 
program states and whose transitions keep information about the actions in 
the models. 

Silent Transition Saturation. The internal/silent transitions in the LTSs rep- 
resenting the input models are saturated using standard algorithms coming 
from process algebras [14]. As a result, saturated LTSs are generated, these 
are needed for verifying the masking relation when internal transitions are 
present. 

Game Graph Generation. It uses the saturated LTSs to produce a game 
graph. Nodes in this graph encode the actual configuration of the game: the 
next player to play, the last played action, and references to the LTS states 
corresponding to the actual configuration of the game. Transitions in this 
graph correspond to the possible plays for the players, i.e., transitions in the 
original LTSs. 
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0. ERR_STATE 

1. { <> , Imt.readi , <mir,micO,mic2> , V } 

2. { <> , # , <mir,micO,mic2> , R } 

3. { <> , Imi.faili , <mir,mic0,mic2> , V } 

4. { <> , # , <mic2> , R } 

5. { <> , Imi.fail3 , <mic2> , V } 

6. <> ,#,<>,R} 

7. { <miw,mir,micO> , Imi.writeO , <> , V } 

8. { <miw,mir,micO> , # , <miw,mir,micO,mici,mic2> , R } 


Fig. 3. Error trace for the memory cell example. 


Shortest Path Algorithm. If the input models are deterministic, Dial’s short- 
est path algorithm is used to get the shortest path to the error state, from 
which the final value is calculated. 

Fix-Point Algorithm. If the input models are non-deterministic, a bottom-up 
breadth-first search is used to compute the value of the game. This algorithm 
is based on well-known algorithms to solve reachability games that use 
attractor sets [15]. 


As explained above, an interesting point about our implementation is that, 
for deterministic systems, the masking distance between two systems can be 
computed by resorting to Dial’s shortest path algorithm [17], which runs in linear 
time with respect to the size of the graphs used to represent the systems. In 
the case of non-deterministic systems, a fixpoint traversal approach based on 
breadth-first search is needed, making the algorithm less efficient. However, even 
in this case, the algorithm is polynomial. 


2.2 Usage 


The standard command to execute MaskD in a Unix operating system is: 
./MaskD <options> <spec_path> <imp_path> 


In this case the tool returns the masking distance between the specification and 
the implementation. Possible optional commands are: -t: print error trace, 
prints a trace to the error state; and -s: start simulation, starts a simu- 
lation from the initial state. A path to the error state is a useful feature for 
debugging program descriptions, which may be failing for unintended reasons. 
A trace for the memory cell example is shown in Fig. 3. States are denoted as 
{spec_state, last_action_played, imp_state, player_turn}. In this case, 
after two faults (bits being flipped), performing a read on the cell leads to the 
error state since on the nominal model the value is 0 while on the fault-tolerant 
model the value read by majority is 1. On the other hand, the simulation feature 
allows the user to manually select the available actions at each point of the mask- 
ing game, which is also useful for verifying that the models behave as intended. 
By default, MaskD computes the masking distance for the given input using the 
algorithm for non-deterministic systems. The user can use option -det to switch 
to the deterministic masking distance algorithm. 
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Case Study Redundancy M. Distance Time Time(Det) 
3 bits 0.333 0.7s 0.6s 
Redundant Memory Cell 5 bits 0.25 2.5s 1.9s 
7 bits 0.2 7.28 5.7s 
3 modules 0.333 0.6s 0.5s 
N-Modular Redundancy 5 modules 0.25 1.2s 0.7s 
7 modules 0.2 5.6s 3.8s 
ea ‘ 2 philosophers 0.5 0.6s 0.6s 
Dining Fbilosgphers 3 philosophers 0.333 1.9s 0.9s 
3 3 generals 0.5 0.9s — 
Byzantine Genera 4 generals 0.333 17.1s = 
1 follower 0 0.7s 0.8s 
Raft OD) 2 followers 0 5.6s 3.6s 
1 retransm. 0.333 4.2s = 
BRP (5) 5 retransm. 0.143 4.8s = 
10 retransm. 0.083 6.1s = 


Table 1. Some results of the masking distance for the case studies. 


3 Experiments 


We report on Table 1 some results of the masking distance for multiple instances of 
several case studies. These are: a Redundant Cell Memory (our running example), 
N-Modular Redundancy (a standard example of fault-tolerant system [18]), a 
variation of the Dining Philosophers problem [8], the Byzantine Generals problem 
introduced by Lamport et al. [12], the Log Replication consistency check of Raft 
[16], and the Bounded Retransmission Protocol (a well-known example of fault- 
tolerant protocol [9]) where we have modeled using silent actions and evaluating 
it with the weak masking distance. All case studies have been evaluated using 
the algorithms for both deterministic and non-deterministic games, with the 
exception of the non-deterministic models (i.e., the Byzantine Generals problem 
and the Bounded Retransmission Protocol). It is worth noting that most of 
the computational complexity arises from building the game graph rather than 
the actual masking distance calculation. For space reasons, we omit details of 
each case study and its complete experimental evaluation (delegated to the tool 
documentation). 

Some words are useful to interpret the results of our running example. For 
the case of a 3 bit memory the masking distance is 0.333; the main reason for 
this is that the faulty model (in the worst case) is only able to mask 2 faults (in 
this example, a fault is an unexpected change of a bit) before failing to replicate 
the nominal behaviour (i.e., reading the majority value). Thus, the result comes 
from the definition of masking distance and taking into account the occurrence 
of two faults. The situation is similar for the other instances of this problem with 
more redundancy. 

We have run our experiments on a MacBook Air with a 1.3 GHz Intel Core 
i5 processor and 4 GB of memory. The case studies for reproducing the results 
are available in the tool repository. 
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Abstract. Dafny is a verification-aware programming language used at 
Amazon Web Services to develop critical components of their access man- 
agement, storage, and cryptography infrastructures. The Dafny toolchain 
provides a verifier that can prove an implementation of a method satis- 
fies its specification. When the underlying SMT solver cannot establish 
a proof, it generates a counterexample. These counterexamples are hard 
to understand and their interpretation is often a bottleneck in the proof 
debugging process. In this paper, we introduce an open-source tool that 
transforms counterexamples generated by the SMT solver to a more user- 
friendly format that maps to the Dafny syntax and is suitable for further 
processing. This new tool allows the Dafny developers to quickly identify 
the root cause of a problem with their proof, thereby speeding up the 
development of Dafny projects. 


Keywords: Dafny - Counterexample - Verification - SMT 


1 Introduction 


Dafny [12,11,6] is a verification-aware programming language popular in the 
automated reasoning community. Amazon Web Services (AWS), in particular, 
uses Dafny to develop critical components of their access management, storage, 
and cryptography infrastructures [5]. For these components, developers at AWS 
are writing Dafny programs that include the specification and the corresponding 
implementation. The advantage of using Dafny is that one can leverage the 
built-in verifier during the development process to automatically prove that the 
implementation of a method satisfies its specification. Finally, Dafny provides 
compilers for generating executable code in different target languages, such as 
C#, Java, and Go. For example, AWS developers have implemented the core 
AWS authorization logic in Dafny, and generated production Java code using a 
custom Java compiler. However, despite its advantages, Dafny has so far lacked 
in debugging functionality that could guide the developer to the root cause of 
a potential assertion (i.e., proof) failure. This was slowing down the developers, 
and it prompted the work on counterexample extraction that we present in this 
paper. 
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To confirm that an assertion holds, Dafny verifier first translates Dafny source 
into the Boogie [1,3] intermediate verification language. Boogie generates a ver- 
ification condition and submits it to an SMT solver (in our case Z3 [13,15]). 
When an assertion is violated, the solver provides a counterexample (i.e., a coun- 
terecample model). Understanding such counterexamples is key to debugging a 
failing proof. However, due to the two translation steps separating Dafny code 
from the SMT query, the counterexamples provided by the solver are difficult 
to understand and inhibit the debugging process. The scope of the problem be- 
comes apparent from the fact that a counterexample extraction tool was once 
developed for Boogie [10], a language that is much closer to the solver in the 
verification pipeline than Dafny. 

Prior attempts to present Dafny counterexamples in a human-readable for- 
mat [9,8] have been successful with integers and Booleans but yielded unsatis- 
fying results for other types. Our main contribution is a tool that improves the 
readability of Dafny counterexamples for other basic types, user-defined types, 
and collections. The tool converts a counterexample generated by the solver to 
a format that is intuitive to Dafny developers. In addition to improving the user 
experience, our tool lays the foundation for automatic test case generation, as 
we discuss in Section 4. 


2 Motivation 


Fig. 1 shows our running example of a Dafny program. The program defines 
a class Simple with an instance method Match that returns true if argument s 
(of type string that is alias for seq<char>) matches the pattern p. For sim- 
plicity, we only allow the '?' meta-character in the pattern, which matches any 
character. The program also includes specifications in the form of preconditions, 
postconditions, and loop invariants. The Dafny verifier uses these to prove the 
correctness of the method implementation. 

To demonstrate the usefulness of counterexamples and the need to present 
them in a human-readable format, we introduce a bug into the Match method. 
We do this by deleting the part of the guard highlighted on line 16, thereby 
turning the method into a string equality check. The implementation of the 
method and its specification are no longer in agreement, and the Dafny verifier 
reports that the postcondition on line 7 might be violated on line 18. Even in 
this simple case, the information that the verifier gives, although it might help 
in localizing the problem, does not make the cause of the bug apparent. The 
counterexample provided by the solver spans hundreds of lines and is difficult to 
read. For example, Fig. 2 gives a slice of this counterexample showing just that 
variable s has type seq<char>. 

In contrast, our tool, released with Dafny v3.3.0, generates the following 
counterexample that triggers the postcondition violation: 


s:seq<char> = (Length := 1, [0] := 'A'); 
this:Simple = (p: @1); 
@1:seq<char> = (Length := 1, [0] := '?'); 
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class Simple 


{ 
var p:string 
method Match(s: string) returns (b: bool) 
requires |p| == |s| 
ensures b <==> forall n :: 0 <= n < |s| ==> 
s[n] == p[n] || p[n] == ’?’ 
{ 
var i := Q; 
while i < |s| 
invariant i <= |s| 
invariant forall n :: 0 <= n < i ==> 
s[n] == p[n] || p[n] == ’?’ 
{ 
if s[i] != p[i] && p[i] != ’?’ 
{ 
return false; 
} 
i := i + 1; 
} 
return true; 
} 
} 


Fig. 1: A Dafny program that matches a string against a pattern. The highlighted 
code is removed to introduce a bug as described in Section 2. 


Here, the first line indicates that argument s is a sequence of characters (i.e., a 
string) of length 1, where the character at index 0 is A. Field p of the receiving 
object (this) points to object @1, where @1 is a string of length 1 with the 
? meta-character at index 0. With these inputs, the buggy implementation of 
method Match returns false because the pattern and argument are not identical, 
even though they should match according to the specification. 


Before we incorporated our tool into Dafny, it would report the following 
counterexample for this same program: 


s = [Length 1]C(T@U!val!71); this = (T@QU!val!75); 


Clearly the counterexample generated by our tool is much more informative. 
Among the tools in this space that we know of, only Why3 [7] has counterexample 
generation functionality of similar complexity. 
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s#0 -> T@U!val!71 // Boogie variable s#Q has ID 71 

BoxType -> T@T!val!15 // Boogie’s Box type has ID 15 

type -> { // The Boogie type of variable s#0 has ID 22 
T@U!val!71 -> T@T!val!22 

} 

SeqTypeInv® -> { // Boogie type of s#0 is Seq Box: 
T@T!val!22 -> T@T!val!15 

3 

$Is -> { // Dafny type of variable s#0 has ID 76 
T@QU! val!71 T@U!val!76 -> true 

3 

Tag -> { // Type with ID 76 is a subtype of a type with ID 13 
T@U!val!76 -> T@U! val!13 

3 

TagSeq -> T@U!val!13 // Dafny type with ID 13 is seq 

TChar -> T@U!val!1 // Dafny type with ID 1 is char 

InvO_TSeq -> { // Dafny type with ID 76 is seq<char> 
T@U!val!76 -> T@U!val!1 

3 


Fig. 2: An extract of a counterexample model generated by Z3 for the code in 
Fig. 1 that shows that variable s has type seq<char>. 


3 Design and Implementation 


We implemented our tool on top of the existing Dafny counterexample extraction 
functionality by adding key new features such as the ability to extract types from 
the Z3 model and support complex types (e.g., sequences) beyond just integers 
and Booleans. Our type extraction supports type parameterization and type 
renaming, and makes extracted counterexamples useful beyond improved user 
experience, e.g., automatic test case generation (see Section 4). 

We illustrate how the counterexample generation tool works using our run- 
ning example from Fig. 1. Before the tool can look up the types and values of 
specific variables, it must first identify the variables and program states! relevant 
to the given counterexample. In our example, there are four relevant program 
states: the initial state, the state following the initialization of i, the state at the 
loop head, and the state preceding the return statement. There are three rele- 
vant variables: this, s, and i. Our tool inherits the extraction of this information 
from the Z3 model from the existing counterexample generator. 

Once we identify the relevant variables and states, we determine the type of 
each variable. This is a two-step process. First, we extract the Boogie type of 
a variable in the Boogie translation from the Z3 model (e.g., Seq Box for s in 
Fig. 2). Then, we map it to its corresponding Dafny type (seq<char> for s in 


1 Dafny to Boogie translator marks Dafny program states with the :capturedState 
annotation in Boogie. 
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Variable Constraint Counterexample 
b:bv6 b == 1 b:bv6 := 0 
rireal r != 0.2 rireal := 1.0/5.0 
c:char ¢ l= "¢" c:char := 'c' 
c:char È == Tg! c:char := 'A' 
d:M.DType d.i>4 d:M.DType = ACi := -34) 
atarray2?<int> a.LengthO < 2 || a:_System.array2?<int> := 
a.Length1l < 2 || (Length® := 2, Length1 := 40, 
a[1,1] != 3 [1,1] := 3) 
s:set<int> 1 ins s:set<int> = {1 := false} 
s:set<int> 1 tins s:set<int> = {1 := true} 
s:seq<int> Is| < 1 || s[0] != 3 |s:seq<int> = [3] 
s:seq<int> Is| <2 || s[1] != 3 |s:seq<int> = (Length := 2, 
[1] := 3) 
m:map<int, char> |1 !in m m:map<int,char> = (1 := 'A', 
2 eB 3 4Sa eD 


Table 1: Counterexamples generated for different constraints. 


Fig. 2). The latter step may require choosing among the different types listed by 
the model (e.g., between string and seq<char>). We give preference to the orig- 
inal type names (seq<char>) to clearly separate user-defined from built-in types. 
We also take special care to extract type parameters and reconstruct the Dafny 
type name from its Boogie translation, for example, Module.Module2.Class 
from Module_mModule2.Class. 

After determining the type of a variable, our tool extracts the string rep- 
resentation of the variable’s value. The way the value is specified in the coun- 
terexample model depends on the variable type. In method Match in Fig. 1, the 
receiver is an instance of a user-defined class Simple, so the tool looks up the 
value of its only field this.p. This field is itself a non-primitive variable, and so 
we recurse into its definition until we reach a value of a primitive type, which we 
then use to construct the non-primitive value. In case the model does not specify 
a value for some variable of primitive type, the tool automatically generates an 
adequate value that is different from any other value of that type in the model 
or source code. 

Our implementation of the counterexample extraction tool supports all ba- 
sic types, user-defined classes, datatypes, arrays, and the three most commonly 
used collections (sequence, sets, and maps). See Table 1 for concrete examples 
of the tool’s output. Previously, the counterexample generator could only show 
the values of integer and Boolean variables, constructor names used to create a 
datatype, or the length of a sequence. The differences between our new imple- 
mentation and past versions are mostly due to the support we added for new 
types and collections (e.g., chars, bit vectors, maps). However, we also had to 
revamp and bring up-to-date some of the previously implemented features that 
have since ceased to function as intended. For instance, Krucker and Schaden [9] 
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show that they could once extract the values of object’s fields, but this function- 
ality had not been maintained and it stopped working properly. We speculate 
that the lack of automated testing likely contributed to the failure to adapt the 
counterexample extraction to the rapidly evolving Dafny infrastructure. To en- 
sure maintainability, we have developed an extensive test suite as part of this 
work. The test suite contains 54 tests covering all supported types and collec- 
tions, and is executed as part of the continuous integration process of Dafny. 

To benefit from the counterexample extraction feature while working in Vi- 
sual Studio Code IDE, the user needs only to install the Dafny plugin.” In addi- 
tion to visualizing counterexamples in the VS Code plugin, the counterexample 
extraction tool provides a public API and can be imported as a dependency by 
any C# project. Finally, we made our accompanying artifact publicly available 
to improve the reproducibility of our contributions [4]. 


4 Conclusions and Future Work 


This paper presents the new, improved version of Dafny’s counterexample ex- 
traction tool, which now extracts values of all variables of basic or user-defined 
types as well as variables representing algebraic datatypes, arrays, sequences, 
sets, and maps. We integrated the tool into the Dafny plugin for Visual Studio 
Code, and released it with Dafny v3.3.0. The tool has already been used by 
Dafny developers to assist them during the proof debugging process. 

Note that a counterexample reported by the Dafny verifier might occasion- 
ally be a spurious one. This is a well-known problem that users of these veri- 
fiers struggle with. It is typically due to the incompleteness of the underlying 
SMT solver, for example, in the presence of quantifiers. A possible solution to 
identifying spurious counterexamples is to generate a concrete test case from 
the counterexample, execute the program concretely using the test case, and 
observe whether the concrete execution violates the same property [2,14]. The 
counterexample extraction tool presented in this paper, with its ability to extract 
the type and concrete value of any variable, can be used for test case generation 
as well. As future work, we plan to build on this functionality and implement 
extensions for identifying spurious counterexamples as well as for automatic unit 
test generation. 
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Abstract. Cvc5 is the latest SMT solver in the cooperating validity 
checker series and builds on the successful code base of CVC4. This paper 
serves aS a comprehensive system description of CVvc5’s architectural 
design and highlights the major features and components introduced 
since CVC4 1.8. We evaluate CvC5’s performance on all benchmarks in 
SMT-LIB and provide a comparison against CVC4 and Z3. 
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ulo theories - cvc5 


1 Introduction 


SMT solvers are widely recognized as crucial back-end reasoning engines for 
a variety of applications, including software and hardware verification [19, 52, 
60, 68, 82, 86], model checking [41, 42, 98], type checking, static analysis, secu- 
rity [10,62], automated test-case generation [40,135], synthesis [2,65], planning, 
scheduling, and optimization [127]. Notable SMT solvers include Bitwuzla [92], 
Boolector [98], CVC4 [21], MathSAT [46], OpenSMT2 [72], SMTInterpol [44], 
SMT-RAT [50], STP [61], veriT [35], Yices2 [55], and Z3 [90]. 

Among these, the family of cooperating validity checker (CVC) tools [21, 26, 
27,132] have played an important role, both in research and in practice [11, 48, 
70, 137, 138]. The most recent incarnation, CVC4, was a from-scratch rewrite of 
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SRC, United Technologies Research Center, and Stanford University—including the 
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the Agile Hardware Center (AHA), and the SystemX Alliance. More details can be 
found at: https://cvc5.github.io/acknowledgements.html. 
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CVC3, written with the aim of creating a flexible and performant architecture 
that could last far into the future. The fact that CVC4 has integrated over a 
decade’s worth of SMT research and development while becoming increasingly 
robust and performance-competitive attests to the success of that endeavor. 

In this paper, we introduce Cvc5d, the next solver in the series. CVC5 is not 
a rewrite of CVC4 and indeed builds on its successful code base and architec- 
ture. Compared to other SMT solvers, CvC5 supports a diverse set of theories 
(all standard SMT-LIB theories, and many non-standard theories) and features 
beyond regular SMT solving such as higher-order reasoning and syntax-guided 
synthesis (SyGuS) [3]. The name-change® rather acknowledges both a (mostly) 
new team of developers as well as the significant evolution the tool has under- 
gone since CVC4 was described in a tool paper published in 2011 [21]. Moreover, 
Ccvc5 comes with updated documentation, new and improved APIs, and more 
user-friendly installation. Most importantly, it introduces several significant new 
features. Like its predecessors, CVC5 is available under the 3-clause BSD open 
source license and runs on all major platforms (Linux, macOS, and Windows). 

We make the following contributions: 


— An in-depth description of the architectural design of Cvc5 and how its pieces 
and modules work together. 

— A comprehensive summary of all features that have been added to the solver 
since CVC4 was introduced in [21]. 

— A description of major features introduced since CVC4 1.8, the final version 
of CVC4, including: 

a new C++ API, and new Python and Java APIs that build on top of it; 

a new theory solver for the theory of fixed-size bit-vectors; 

a new and extensive proof-production module; 

a new procedure for non-linear arithmetic; and 

a syntax-guided quantifier-instantiation procedure [96]. 

— Evidence, based on experimental evaluation and industrial use cases, that 
CVC5 is in fact both versatile and industrial-strength. 


2 Architecture and Core Components 


cvcd supports reasoning about quantifier-free and quantified formulas in a wide 
range of background theories and their combinations, including all theories stan- 
dardized in SMT-LIB [22]. It further natively supports several non-standard the- 
ories and theory extensions. These include, among others, separation logic, the 
theory of sequences, the theory of finite sets and relations, and the extension of 
the theory of reals with transcendental functions. 

In this section, we start with a brief overview of the core components of CVC5, 
and then discuss them in more detail in the following subsections. A high-level 
overview of the system architecture is given in Figure 1. 


6 Whereas the convention for previous solvers in the CVC family was to use capital let- 
ters, here we introduce a new convention of using lower-case letters (or alternatively 
small capitals, as in this paper, which we find to be more visually appealing). 
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Fig. 1: High-level overview of CvC5’s system architecture. 


The central engine of Cvc5 is the SMT Solver module, which is based on 
the CDCL(7) framework [99] and relies on a customized version of the MiniSat 
propositional solver [57] at its core. The SMT Solver consists of several compo- 
nents: the Rewriter and the Preprocessor modules, which apply simplifications 
locally (at the term level) and globally (on the whole input formula), respec- 
tively; the Propositional Engine, which serves as a manager for the CDCL(T) 
SAT solver; and the Theory Engine, which manages theory combination and all 
theory-specific and quantified reasoning procedures. 

Besides standard satisfiability checking, CvC5 provides additional function- 
ality such as abduction, interpolation, syntax-guided synthesis (SyGuS) [3], and 
quantifier elimination. Each of these features is implemented as an additional 
solver built on top of the SMT Solver. The SyGuS Solver is the main entry point 
for synthesis queries, which encode SyGuS problems as (higher-order) satisfiabil- 
ity problems with both semantic and syntactic constraints [114]. The Quantifier 
Elimination Solver performs quantifier elimination based on tracking the quan- 
tifier instantiations of the SMT Solver [116]. The Abduction Solver and the 
Interpolation Solver are both SyGuS-based [110] and thus are built as layers on 
top of the SyGuS Solver. 

cvc5 provides a C++ API as the main interface, not just for external client 
software, but also for its own parser and for additional language bindings in Java 
and Python. Cvc5 also provides a textual command-line interface (CLI), built on 
top of the parser, which supports SMT-LIBv2 [25], SyGuS2 [104] and TPTP [134] 
as input languages. The Proof Module can output formal unsatisfiability proofs 
in three proof formats: Alethe [128], Lean 4 [88], and LFSC [133]. 


2.1 The SMT Solver Module 


The SMT Solver module is the centerpiece of Cvc5 and is responsible for han- 
dling all SMT queries. Its functionality includes, in addition to satisfiability 
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checking, constructing models for satisfiable input formulas and extracting as- 
sumptions, cores, and proof objects for unsatisfiable formulas. The main com- 
ponents of the SMT Solver module are described below. 


Preprocessor. Before any satisfiability check, CvC5 applies to each formula 
from an input problem a sequence of satisfiability-preserving transformations. 
We distinguish between (i) required normalization passes, e.g., removal of ite 
terms; (ii) optional simplification passes aimed at making the formula easier to 
solve, e.g., finding entailed theory literals; and (iii) optional reduction passes that 
transform the formula from one logic to another, e.g., from non-linear integer 
arithmetic to a bit-vector problem with configurable bit-width. Currently, cvc5 
implements 34 passes, executed in a fixed order. Optional passes can be enabled 
and disabled via configuration options. Preprocessing passes are self-contained, 
and adding or modifying passes does not require knowledge of the internals of 
the SMT solver engine. 


Propositional Engine. The Propositional Engine serves as the core CDCL(T) 
engine [99], which takes the Boolean abstraction of the input formula (together 
with any lemmas produced during solving) and produces a satisfying assignment 
for that abstraction. Its main components are the Clausifier and the propositional 
satisfiability (SAT) solver. The Clausifier converts the Boolean abstraction into 
Conjunctive Normal Form (CNF), which then serves as input for the SAT solver. 
In cvc5, as in CVC4, we use a customized version of MiniSat [57] as the core 
SAT solver. Extensions we have added to MiniSat include: the production of 
resolution proofs; native support for pushing and popping assertions; and a De- 
cision Engine [12], which can be used to create customized decision heuristics 
for MiniSat. 

During its search, the Propositional Engine asserts a theory literal (—)p to 
the Theory Engine as soon as the SAT solver assigns a truth value to the propo- 
sitional variable abstracting the atom p. We refer to the set of all such literals 
as the currently asserted literals. When checking the consistency of the set L 
of currently asserted literals in the overall background theory 7, we distinguish 
between two levels of effort: standard and full, depending on whether the SAT 
solver has a partial or full model, respectively, for the Boolean abstraction. At 
standard effort, a theory solver may optionally perform some lightweight con- 
sistency checking. At full effort, the theory solver must either produce a lemma 
(following the splitting-on-demand approach [23]) or determine whether L is sat- 
isfiable or not and, in the latter case, produce a conflict clause, a clause that is 
valid in the theory 7 but is inconsistent with L. 


Rewriter. The Rewriter module is responsible for converting terms via a set 
of rewrite rules into semantically equivalent normal forms. In contrast to pre- 
processing, rewriting is done during solving. In fact, all major components of 
cvcd invoke the Rewriter to ensure that the terms they work with are normal- 
ized, thereby simplifying their implementation. Rewrite rules are applied locally, 
i.e., independent of the currently asserted literals, and are divided into required 
and optional rules, of which the latter can be enabled or disabled by the user. 
The Rewriter maintains a cache to avoid processing any term more than once. 
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Examples of rewrites include simplifications such as x + 0 ~ x, normalizations 
that sort the operands of associative and commutative operators, and operator 
eliminations such as z < y ~œ y +1 > (when z and y have integer sort). In 
certain contexts, e.g., enumerative SyGuS approaches, aggressive rewriting rules, 
which would be detrimental to SMT solving, can be beneficial. Such rules are 
implemented in an Extended Rewriter, which is enabled when needed. 

To help automate improvements to the Rewriter, we developed a work- 
flow that detects and enumerates new rewrite rule candidates using the SyGuS 
solver [101]. It works by detecting and suggesting critical pairs, i.e., pairs of 
equivalent terms that are not rewritten to the same term by the current rules. 


Theory Engine. The Theory Engine is the main entry point for checking the 
theory consistency of the theory literals asserted by the Propositional Engine. It 
dispatches each of these literals to the appropriate theory solvers and is further 
responsible for dispatching any propagated literals or lemmas generated by the 
theory solvers back to the Propositional Engine. 

When multiple theory solvers are enabled, the Combination Engine sub- 
module is responsible for coordinating between them. Like CVC4, Cvc5 uses 
the polite theory combination mechanism [74, 108, 130]. This includes propagat- 
ing or performing case splits on equalities and disequalities between shared terms 
(terms appearing in the literals of more than one theory solver). As in CVC4, 
the algorithm for computing these splits is based on care graphs [75]. 

The Combination Engine controls the Model Manager, which is responsible 
for combining models from multiple theories and constructs a model for the input 
formula. The Model Manager also maintains an equivalence relation F over all 
the terms in the input formula, induced by all of the currently asserted literals 
that are equalities. When invoked, the Model Manager has the responsibility 
of assigning concrete values to each equivalence class of E with the assistance 
of the individual theory solvers, which provide values for terms in their theory. 
Typically, the Model Manager is invoked only when the theory solvers have 
reached a saturation point that allows the Theory Engine to conclude that the 
input problem is satisfiable (and thus, a model can be constructed successfully). 

As in CVC4, each sub-formula of the input that starts with a quantifier is 
abstracted by a propositional variable. When any such variable or its negation 
is asserted, the Theory Engine dispatches the corresponding quantified formula 
to the Quantifiers Module, which generates suitable quantifier instantiations. 
Since certain techniques for handling quantified formulas, e.g., E-matching [89], 
require knowledge of the state and terms known by the other theory solvers, this 
module has access to all equality information from all theory solvers. 


Theory Solvers. Cvc5 supports a wide range of theories, including all theo- 
ries standardized in SMT-LIB. Each theory solver relies on an Equality Engine 
Module, which implements congruence closure over a configurable set of oper- 
ators, typically those that belong to the solver’s theory. The Equality Engine 
is responsible for quickly detecting conflicts due to equality reasoning. In addi- 
tion, all theories communicate reasoning steps to the rest of the system via the 
Theory Inference Manager. Every theory solver emits lemmas, conflict clauses, 
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and propagated literals through this interface. The Theory Inference Manager 
implements or simplifies common usage pattern like caching and rewriting lem- 
mas, proof construction, and collection of statistics. Every lemma or conflict sent 
from a theory is associated with a unique identifier for its kind, the inference 
identifier, which is a crucial debugging aid. Below, we briefly survey the theory 
solvers in CvC5, along with their main implementation techniques. 


Linear Arithmetic. The linear arithmetic solver [78] extends the simplex pro- 
cedure adapted for SMT by Dutertre and de Moura [56]. It implements a sum- 
of-infeasibilities-based heuristic [79], an integration with the external GLPK LP 
solver [80], and certain heuristics proposed by Griggio [63]. Integer problems are 
handled by solving their real relaxation before using branching [64] and cutting 
planes [54] to find integer solutions. The branch-and-bound method optionally 
generates lemmas consisting of ternary clauses inspired by unit-cube tests [39]. 


Non-linear Arithmetic. For non-linear arithmetic problems, CvC5 resorts to 
linear abstraction and refinement. It uses a combination of independent sub- 
solvers integrated with the linear arithmetic solver and invoked only when the 
linear abstraction is satisfiable. One sub-solver implements cylindrical algebraic 
coverings [1], while the other sub-solvers are based on incremental lineariza- 
tion [45]. A variety of lemma schemas are used to assert properties of non-linear 
functions (e.g., multiplication and trigonometric functions) in a counterexample- 
guided fashion [123]. Non-linear integer problems are solved by incremental lin- 
earization and incomplete techniques based on reductions to bit-vectors. 


Arrays. As in CVC4, the array solver is based on a decision procedure by 
de Moura and Bjgrner [91] but following the more detailed description by Jo- 
vanovié and Barrett [75]. An alternative experimental implementation based on 
an approach by Christ and Hoenicke [43] is also available. 


Bit- Vectors. For the theory of fixed-size bit-vectors, CvC5’s main approach 
is bit-blasting, which refers to the process of translating bit-vector problems into 
equisatisfiable SAT problems, and is applied after preprocessing. In Cvc5, we 
distinguish two modes for bit-blasting: lazy and eager. Lazy bit-blasting seam- 
lessly integrates with the CDCL(7) infrastructure of cvc5 and fully supports 
the combination of bit-vectors with any theory supported by cvc5d. It further 
leverages the full power of Cvc5’s Equality Engine for reasoning about equali- 
ties over bit-vector terms and also uses the solve-under-assumptions feature [57] 
supported by many state-of-the-art SAT solvers. For problems that can be fully 
reduced to bit-vectors, CVC5 can also be used in eager mode. This mode does 
not rely on solving under assumptions, but instead directly asserts all of the 
bit-blasted constraints to the SAT solver, which usually enables more simplifica- 
tions. Additionally, cvc5 supports the Ackermannization and eager bit-blasting 
of constraints involving uninterpreted functions and sorts [66]. 


Datatypes. For quantifier-free constraints over datatypes, we use a rule-based 
procedure that follows calculi already implemented in CVC4 [24,112] and that 
optimizes the sharing of selectors over multiple constructors [125]. 
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Floating-Point Arithmetic. Formulas in the theory of floating-point arith- 
metic are translated to equisatisfiable formulas in the theory of bit-vectors, in a 
process referred to as word-blasting. For this, cvC5 integrates the SymFPU [37] 
library, which was first used in CVC4 and has also been integrated in the Bitwu- 
zla SMT solver [92]. This approach admits several optimizations compared to 
earlier solvers, which translate directly to the bit-level, e.g., CNF or AIGs. An- 
other difference from older approaches [38] is that translation is done at the for- 
mula level instead of the term level. Conversions between real and floating-point 
terms are treated as uninterpreted functions and refined if the models of the real 
arithmetic and the floating-point solver do not agree. The refinement lemmas 
use the monotonicity of the conversion functions to constrain the floating-point 
and real arithmetic terms to matching intervals that exclude the current model. 


Sets and Relations. CVC5 implements a solver for the parametric theory of 
finite sets, i.e., sets whose elements are of any sort supported by cvc5. The 
core decision procedure for sets is extended with support for cardinality con- 
straints [13]. The set theory solver is extended with a sub-module that specializes 
in relational constraints [87], where relations are modeled as sets of tuples. 


Separation Logic. In separation logic, the semantics of constraints assume 
a location and data type for specifying the model of the heap. CvC5 supports 
an extension of the SMT-LIB language for separation logic [73], in which the 
location and data types of the heap can be any sort supported by cvc5. The 
classical separation logic connectives are treated as theory predicates which are 
lazily reduced to constraints over sets and uninterpreted functions [115]. 


Strings and Sequences. For strings and sequences, CVC5 implements a solver 
consisting of multiple layered components. At its core, the solver reasons about 
length constraints and word equations [84], supplemented with reasoning about 
code points to handle conversions between strings and integers efficiently [119]. 
Extended functions such as string replacement are lazily reduced to word equa- 
tions after context-dependent simplifications [126]. When necessary, the regular 
expressions in input problems are unfolded and derivatives are computed [85]. 
The string theory solver further incorporates aggressive simplification rules that 
rely on abstractions to derive facts about string terms [118]. Finally, conflicts 
are detected eagerly on partial assignments from the SAT solver by computing 
the congruence closure and constant prefixes and suffixes of string terms. 


Uninterpreted Functions. The theory of uninterpreted functions is handled 
in largely the same way as in CVC4. It follows Simplify’s approach [53] ex- 
tended with support for fixed finite cardinality constraints [121]. This extension 
is used in combination with finite-model-finding techniques for finding finite 
models based on minimal interpretations of uninterpreted sorts. 


Quantifiers. Quantified formulas are all handled by the Quantifiers Mod- 
ule, which resembles a theory solver. The module contains many sub-solvers, all 
based on some form of quantifier instantiation, and each specializing in solving 
specific classes of quantified formulas. The Quantifiers Module relies on heuris- 
tic E-matching when uninterpreted functions are present [89]. This technique 
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is supplemented by conflict-based instantiation for detecting when an instanti- 
ation is in conflict with the currently asserted literals [16,124]. The Quantifiers 
Module additionally incorporates finite-model-finding techniques, which are use- 
ful for detecting satisfiable input problems [122]. It also relies on enumerative 
approaches when other techniques are incomplete [109]. For quantifiers over lin- 
ear arithmetic, it uses a specialized counterexample-guided based approach for 
quantifier instantiation [116]. An extension of this technique is used for quanti- 
fied bit-vector logics [95]. For other quantified logics in pure background theories, 
e.g., over floating-point or non-linear arithmetic, CvC5 relies on syntax-guided 
quantifier instantiation [96]. The Quantifiers Module also contains sub-solvers 
implementing more advanced solving paradigms, including: a module for doing 
Skolemization with inductive strengthening and enumeration of sub-goals for 
inductive theorem proving problems [117], a finite-model-finding technique for 
recursive functions [113], and a solver for syntax-guided synthesis [114]. 


2.2 Proof Module 


The Proof Module of Cvc5 was built from scratch and replaces the proof system 
of CVC4 [67,77], which was incomplete and suffered from a number of archi- 
tectural shortcomings. The design of Cvc5’s proof module was guided by the 
following principles. First, the overhead incurred by proof production should be 
at most linear in the solving time. Second, the emitted proofs should be de- 
tailed enough to enable efficient (i.e., polynomial) checking, ensuring that proof 
checking is inherently simpler than solving. Third, disabling a system compo- 
nent when in proof production mode because it lacks adequate proof generation 
capabilities should be done rarely and only if the component is not crucial for 
performance. Finally, given the different needs of users and the trade-offs offered 
by different proof systems, proof production should be flexible enough to allow 
the emission of proofs in different formats. 

Following these design principles, the Proof Module in Cvc5 produces de- 
tailed proofs for nearly all of its theories, rewrite rules, preprocessing passes, 
internal SAT solvers, and theory combination engines. It further supports eager 
and lazy proof production with built-in proof reconstruction. This enables proof 
production for some notoriously challenging functionalities, such as substitution 
and rewriting (common, for example, in simplification under global assumptions 
and in string solving [126]). Furthermore, although it maintains internally a 
single proof representation, CVvC5 is able to emit proofs in multiple formats, in- 
cluding those supported by the LFSC [133] proof checker and the Lean 4 [88], 
Isabelle/HOL [100] and Coq [30] proof assistants. 


2.3 Node Manager 
Formulas and terms are represented uniformly in CvC5 as nodes in a directed 


acyclic graph, reference-counted and managed by the Node Manager. The Node 
Manager further maintains a Skolem Manager, which is responsible for tracking 
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Skolem symbols introduced during solving. All Cvc5 instances in the same thread 
share the same Node Manager instance. 

Nodes are immutable and are aggressively shared using hash consing: when- 
ever a new node is about to be created, the Node Manager checks whether a 
node with the same structure already exists, and if it does, it returns a reference 
to the existing node instead. Besides saving memory, this ensures that syntactic 
equality checks can be performed in constant time (by comparing the unique ids 
assigned to each node). Reference counting allows the Node Manager to deter- 
mine when to dispose of nodes. Weak references are used whenever possible to 
limit the overhead of reference counting. 

Nodes store 96 bits of metadata (id, reference count, kind, and number of 
children) and a variable number of pointers to child nodes. The kind of a node 
can be an operator kind, e.g., addition, or a leaf kind, e.g., a variable. Optional 
additional static information associated with nodes can be stored separately in 
hash maps referred to as node attributes. Since node attributes are managed by 
the Node Manager, which may be shared by multiple solver instances, attributes 
must only be used to capture inherent node properties (i.e., properties that are 
independent of run-time options). 

Many theory solvers, including those for quantifiers, strings, arrays, non- 
linear arithmetic, and sets, introduce terms with Skolem (i.e., fresh) constants 
during solving. Such constants are centrally generated by the Skolem Manager, 
which also associates with each of them a term of the same sort, the constant’s 
witness form. If the computed witness form for a constant matches that of a 
previously used constant, the previous constant can be reused. This not only 
provides a deterministic way of generating fresh constants during solving but 
also allows the system to minimize the number of introduced constants. This 
reuse is crucial for performance in some theory solvers [120]. 


2.4 Context-Dependent Data Structures 


Certain applications of SMT solvers require multiple satisfiability checks with 
similar assertions. To support such applications, the SMT-LIB standard includes 
commands to save (with a push command) the current set of user-level assertions 
and restore (with a pop command) a previous set. This allows the solver to reuse 
parts of the work from earlier satisfiability checks and amortizes startup cost. 
Most of the state of Cvc5 depends directly or indirectly on the current set of 
assertions. So whenever the user pushes or pops, CVC5 has to save or restore 
the corresponding state. Similarly, whenever the SAT solver makes a decision or 
backtracks to a previous decision point, each theory solver has to save or restore 
the corresponding information. 

To support these operations, Cvc5 defines a notion of context level, which 
increases with each push and decreases with each pop operation, and imple- 
ments context-dependent data structures. These data structures behave similarly 
to corresponding mutable data structures provided in the C++ standard library, 
except that they are associated with a context level and automatically save and 
restore their state as the context increases or decreases. For efficiency reasons, 
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Solver () 
s.getIntegerSort () 
s.mkConst(i, "x") 
.assertFormula( 
s.mkTerm (kinds . Equal, 

s.mkTerm (kinds . Mult, 

x, s.mkInteger(2)), 

s.mkInteger (4))) 

s.checkSat () solve(2 * Int("x") == 4) 


nee 


(a) The base cvc5 Python API (b) The “pythonic” API 


Fig. 2: Example of using the Python APIs of cvc5d. 


this state data is stored using a region-based custom allocator that allocates one 
region per context level, allowing all state data associated with a level to be 
freed simultaneously by simply freeing the corresponding region. 


3 Highlighted Features 


In this section, we discuss features that are new in CVC5 as well as some of the 
more prominent user- and developer-facing features. We compare them to their 
counterparts in CVC4 when applicable. 


Application Programming Interfaces (APIs). cvc5 provides a lean, com- 
prehensive, and feature-complete C++ API, which also serves as the main inter- 
face for the parser module and the basis for all other language bindings. The 
parser module uses the same API as external users, without any special priv- 
ileges. CVC5’s C++ API has been designed and written from scratch and thus 
is not backwards compatible with CVC4’s C++ API. It is centered around the 
Solver class, which represents a CVC5 instance and implements methods for 
tasks such as creating terms, asserting formulas, and issuing checks. 

cvc5’s Python API is built on top of cvc5’s C++ API using Cython [29] and 
makes all of Cvc5’s features accessible to Python users. It is a straightforward 
translation of the C++ API without added syntactic sugar such as operator over- 
loading. Additionally, however, CvC5 provides a higher-level layer on top of its 
Python API, which is more user-friendly and pythonic. This layer provides au- 
tomatic solver management, allows SMT terms to be constructed using Python 
infix operators, and converts Python objects to SMT terms of the appropriate 
sort. This leads to much more succinct code, as shown in Figure 2, which com- 
pares using the high- and low-level Python APIs to solve the integer equation 
2-x = 4. The higher-level Python API is based on and designed to work as a 
drop-in replacement for Z3py, the Python API of Z3 [90]. 

cvc5’s Java API is implemented via the Java Native Interface (JNI), which 
allows Java applications to invoke native code and vice versa [83]. In contrast, 
CVC4 uses SWIG [28] to semi-automatically generate bindings. One of the chal- 
lenges of developing a Java API, and the main motivation for implementing it 
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manually instead of using SWIG, is the interaction between Java’s garbage col- 
lector and CVvC5’s reference-counting mechanism for terms and sorts. The new 
API implements the AutoCloseable interface to destroy the underlying C++ ob- 
jects in the expected order. It mostly mirrors the C++ API and supports operator 
overloading, iterators, and exceptions. There are a few differences from the C++ 
API, such as using arbitrary-precision integer pairs, specifically, pairs of Java 
BigInteger objects, to represent rational numbers. In contrast to the old Java 
API, the new API puts greater emphasis on using Java-native types such as 
List<T> instead of wrapper classes for C++ types such as std: :vector<T>. 


Documentation. We provide comprehensive documentation for both cvc5 
users [8] and developers [6]. User documentation contains instructions for build- 
ing and installing Cvc5 and its dependencies, extensive documentation and ex- 
amples of common uses cases for all available APIs, and a thorough description 
of all supported non-standard theories with examples. Developer documentation 
provides details of CvC5 internals and instructions for contributions, including 
guidelines for coding and testing, and a recommended development workflow. 


Proofs. As mentioned above, CvC5 has a new proof system. Proofs are stored 
internally using a new custom intermediate representation. Multiple output proof 
formats are supported via target-specific post-processing transformations on 
this internal representation. The final proof object can then be pretty-printed 
and saved in a text file. The currently supported output proof formats include 
LFSC [133], Alethe [128], and the language of the Lean 4 [88] proof assistant. 

CVC4 proofs exclusively used the LFSC format. Cvc5 continues support for 
LFSC but with a new, more user-friendly syntax. LFSC is a logical framework, 
based on Edinburgh LF [69], which was explicitly designed to facilitate the pro- 
duction and checking of fine-grained proofs in SMT. It comes with a small and 
high-performance proof checker, which is generic in the sense that it takes as 
input both a proof term p and a proof signature, a definition of the data types 
and proof rules used to construct p. The checker verifies that p is well-formed 
with respect to the provided signature. We have defined proof signatures for all 
the individual theories supported by Cvc5. These definitions can be combined 
together as needed to define a proof system for any combination of those theo- 
ries. When emitting proofs in LFSC, cvc5 includes all the relevant signatures 
as a preamble to the proof term. 

The Alethe proof format is a flexible proof format for SMT solvers based on 
SMT-LIB. It includes both coarse- and fine-grained steps and was first imple- 
mented in the veriT solver [34]. Alethe proofs can be checked via reconstruction 
within Isabelle/HOL [15,129] as well as within Coq, the latter via the SMTCoq 
plugin [5,58]. Our main motivation for producing Alethe proofs is to leverage 
these proof reconstruction infrastructures, thus enabling the trustworthy inte- 
gration of cvc5 in Isabelle/HOL and Coq. Users of these tools can leverage the 
integration to dispatch selected goals to Cvc5 for proving, thereby increasing the 
level of automation available to them without requiring a larger trusted core. 
These integrations represent ongoing work in CVvC5 and are being carried out in 
close collaboration with both Isabelle/HOL and Coq experts. 
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Although we aim to have a similar full integration in the Lean 4 [88] proof 
assistant in the future, CVC5 currently only supports the use of Lean 4 as an 
external checker; i.e., CvC5 can emit proofs as Lean terms (for a subset of the 
theories supported by Cvc5), and Lean 4 can then check these proofs. Since the 
underlying logic of Lean 4 is an extension of that of LFSC, this functionality 
follows an approach similar to that used for LFSC by modeling Cvc5 proof rules 
as Lean types and reducing proof checking to type checking. 


Syntax-Guided Synthesis. Cvc5 has native support for syntax-guided syn- 
thesis (SyGuS) problems [3]. As mentioned, the Cvc5 core has a dedicated mod- 
ule for encoding SyGuS problems into (higher-order) SMT formulas, annotated 
with syntactic restrictions. These restrictions are represented via a deep embed- 
ding into the theory of datatypes. Internally, after encoding the SyGuS problem, 
a sub-module of the quantifiers theory, called the synthesis engine, is the main 
entry point for solving. Based on the shape of the input, it uses one of three 
approaches. If the input problem has no syntactic restrictions, and is in single 
invocation form [114], that is, all functions to synthesize are applied to the same 
argument list, then it uses a quantifier-instantiation based approach. Otherwise, 
it uses one of two enumerative approaches, depending on the properties of the 
input [111]. The SyGuS solver also implements further refinements and exten- 
sions of the enumerative approaches, including algorithms for decision-tree learn- 
ing [4] for programming-by-example problems, extended rewriting for enumera- 
tion [101], piecewise-independent unification [17], and static grammar-reduction 
techniques. Furthermore, the SyGuS solver contains specialized procedures to 
support an efficient implementation of interpolation and abduction. 


Interpolation and Abduction. cvc5 computes abducts and Craig inter- 
polants [51] using solvers built on top of the SyGuS solver. The solver for in- 
terpolation translates an interpolation query into a SyGuS conjecture whose 
solutions are interpolants. Specifically, given quantifier-free formulas A and C 
over any combination of the theories supported by Cvc5d, the interpolation solver 
solves for B in the SyGuS conjecture A > B A B - C, with the syntactic 
restriction that B’s free symbols range over the symbols shared by A and C. Any 
synthesized solution for B is, by construction, a Craig interpolant for A and C. 
Abduction is the process of constructing a formula B that is enough to add 
to a formula A to prove some goal formula C (equivalently, to make the formula 
F = AABA-C unsatisfiable). Cvc5’s abduction solver reduces this problem to a 
SyGuS one where C is the formula to be synthesized and F is the semantic con- 
straint. Optionally, the user can also impose syntactic restrictions on the abduct 
B. The SyGuS solver implements specific optimizations for abduction queries, 
such as using unsat cores to prune classes of invalid candidate solutions [110]. 


Non-Linear Arithmetic. The new sub-solver for non-linear arithmetic is 
based on cylindrical algebraic coverings and closely follows [1], with some notable 
extensions. The implementation uses the libpoly library [76], which provides 
polynomial arithmetic and most algebraic routines required for the computation 
of cylindrical algebraic decompositions and coverings. Infeasible subsets are com- 
puted by tracking all contributing assertions for every covering. The infeasible 
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subset is then obtained from the union of assertions from the top-level covering. 
The sub-solver implements several different variable orderings, as these can have 
a significant impact on run-times in practice. Apart from classical variable order- 
ings used for cylindrical algebraic decomposition, some experimental orderings 
based on machine learning have been implemented, roughly following ideas from 
England et al. [59]. (Mixed real-) integer problems are supported by dynamically 
injecting intervals into coverings to cover gaps that do not contain integers. 


Higher-Order Logic. cvc5 has been extended with partial support for higher- 
order logic [18]. The extension is based on a pragmatic approach in which À- 
abstractions are eliminated eagerly via lambda lifting [71]. This approach is used 
with the theory solver for the quantifier-free fragment of the theory of equality 
with uninterpreted functions (EUF) and with the quantifier-instantiation tech- 
nique based on E-matching with triggers [53,89]. For the EUF solver, we added 
support for (dis)equality constraints between functions, via an extensionality 
inference rule, and for partial applications of (Curried) functions. For quan- 
tifier instantiation, we modified several of the data structures for E-matching 
to incorporate matching in the presence of equalities between function values, 
function variables, and partial function applications. The extension also uses 
custom axioms, such as an axiom simulating how functions are updated, to im- 
prove the generation of new A-abstractions, since CvC5 does not yet perform 
HO-unification, which would allow it to synthesize arbitrary \-abstractions. 


New Bit-Vector Solver. Cvc5 features a new bit-blasting solver, which sup- 
ports the use of off-the-shelf SAT solvers such as CaDiCaL [31] or CryptoMin- 
iSat [131] as SAT back-ends for both the eager and lazy bit-blasting approaches. 
In contrast, CVC4’s lazy bit-blasting solver relied on a customized version of 
MiniSat and did not allow the use of more recent state-of-the-art SAT solvers. 


Int-Blasting. In addition to bit-blasting, CvC5 implements int-blasting tech- 
niques, which reduce bit-vector problems to equisatisfiable non-linear integer 
arithmetic problems [97,138]. These techniques are orthogonal to bit-blasting 
and especially effective on unsatisfiable formulas over large bit-widths. 


Syntax-Guided Quantifier Instantiation. cvc5 features a new theory-ag- 
nostic enumerative quantifier-instantiation technique we call syntaz-guided quan- 
tifier instantiation [96]. This technique leverages CvC5’s SyGuS solver to syn- 
thesize terms for quantifier instantiation in a counterexample-guided manner. 


Unsatisfiable Cores. In Cvc5, unsat (short for unsatisfiable) core extraction 
has been completely overhauled. It now uses the new proof infrastructure for 
tracking preprocessing transformations, which, differently from CVC4’s, sup- 
ports most of the preprocessing passes. Unsat cores can be extracted based on 
the constructed proof or via the tracked preprocessing and assumption-based un- 
sat core extraction [47]. For the latter, cvc5 uses the solve-under-assumptions 
feature available in the MiniSat-based SAT engine. This is a lightweight solution 
that does not require the generation of proofs in the SAT solver and full prepro- 
cessing proofs. However, if a user requests both unsat cores and proofs, Cvc5 
switches to proof-based unsat core extraction using the new proof infrastructure. 


428 Barbosa et al. 


Distributed and Central Policies for Equality Reasoning. As mentioned 
in Section 2, the Combination Engine manages theory combination, and theory 
solvers manage their interactions with the rest of the system via their Equality 
Engine. In contrast to CVC4, the policy for assigning an Equalitiy Engine to a 
theory solver in CvC5 is configurable. In the distributed policy, a new Equality 
Engine is generated and assigned for each theory solver. These theory solvers 
perform congruence closure and their theory-specific reasoning locally. The ad- 
vantage of this approach is that the constraints are local to the theory and thus 
do not lead to overhead when combined with other theories. In the central policy, 
a single, shared Equality Engine is assigned to all theory solvers. The advantage 
of this approach is that communication of facts between theory solvers happens 
automatically, which in turn can trigger theory propagations more eagerly. Both 
policies use the same core Equality Engine Module. Each theory solver has been 
refactored to be agnostic with respect to the equality policy. 


Decision Heuristic. For Boolean reasoning, in addition to MiniSat’s decision 
heuristic, cvC5 implements a separate decision heuristic which uses the original 
Boolean structure of the input to keep track of the justified parts of the input 
constraints, i.e., the parts where it can infer the value of terms based on a 
(partial) assignment to sub-terms. To make decisions, this new heuristic traverses 
assertions not satisfied by the currently asserted literals, computing the desired 
values (starting with true as the desired value for the root) for each term until it 
finds an unasserted literal that would contribute towards a desired value. This 
heuristic is a reimplementation and extension of a heuristic [12] implemented 
in CVC4. The heuristic optionally prioritizes assertions that most frequently 
contributed to conflicts in the past using a dynamic ordering scheme. 


Additional Features. Many more aspects and features have been improved 
and implemented with the goal of providing useful information to users and de- 
velopers. Notable examples include: a complete overhaul of CVC4’s mechanism 
for collecting statistics; improved bookkeeping for information about theory lem- 
mas; and a general mechanism for communicating additional information to users 
such as quantifier instantiations and terms enumerated by the SyGuS solver. 


4 Evaluation 


We evaluate CVC5’s overall performance (commit 5£998504) by comparing it 
against Z3 4.8.12 [90] and CVC4 1.8.” Z3 is a widely used, high-performance 
SMT solver which, like Cvc5, supports a wide range of theories. We compare 
against CVC4 to illustrate some of the performance improvements implemented 
as part of the move to cvc5. To run CVC4 optimally, we use the same command- 
line options as those in CVC4’s competition script for SMT-COMP 2020 [9]. 
Similarly, for Cvc5, we use a (slightly updated) version of the competition script 
from SMT-COMP 2021 [7]. For some logics, e.g., quantified logics, these scripts 
try multiple options in a sequential portfolio. 


7 The artifact of this evaluation is archived in the Zenodo open-access repository [14]. 
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Division cvc5 CVC4 Z3 


Arith (7104) 6593 6498 6844 
Bitvec (6045) 5741 5690 5664 
Equality (12159) 6677 6681 4688 
Equality+LinearArith (55948) 49395 48487 49503 
Equality+MachineArith (4712) 2065 1832 1804 
Equality+NonLinearArith (17260) 11088 10906 9341 
FPArith (3170) 2625 2113 2593 
QF Bitvec (42450) 41569 41448 40582 
QF Equality (16254) 16124 16121 16115 
QF Equality+Bitvec (16518) 16274 16333 16318 
QF Equality+LinearArith (3924) 3778 3782 3822 
QF Equality+NonLinearArith (673) 598 610 616 
QF FPArith (76084) 75998 75965 75816 
QF LinearIntArith (9765) 8619 8778 8464 
QF LinearRealArith (2008) 1849 1881 1864 
QF NonLinearIntArith (24261) 17525 16860 18357 
QF NonLinearRealArith (11552) 10889 9207 10354 
QF Strings (69863) 69231 69367 68074 
Total (379750) 346638 342559 340819 


Table 1: Benchmarks solved by cvc5, CVC4, and Z3 with a 20 minute time limit. 


We ran all experiments on a cluster equipped with Intel Xeon E5-2620 v4 
CPUs. We allocated one CPU core and 8GB of RAM for each solver and bench- 
mark pair and ran each benchmark with a 20 minute time limit, the same 
time limit used at SMT-COMP 2021 [102]. We used all non-incremental SMT- 
LIB [22] benchmarks for our evaluation, with the exception of 45 (misclassified) 
benchmarks that have quantifiers in quantifier-free logics and 1128 (misclassi- 
fied) benchmarks that have non-linear literals in linear arithmetic logics. These 
are known misclassifications in the current release of SMT-LIB. Note that many 
benchmarks in SMT-LIB come from industrial applications. 

Table 1 shows the number of solved benchmarks for each solver using the 
same divisions as those used for SMT-COMP 2021. There were no disagreements 
among the solvers on the satisfiability of benchmarks. Overall, Cvc5 solves the 
largest number of benchmarks. Compared to CVC4, Cvc5 solves fewer bench- 
marks in the quantifier-free linear integer arithmetic division due to refactorings 
related to adding proof support. In the quantifier-free equality and bit-vector 
division, CvC5 also solves fewer benchmarks, which we attribute to the fact that 
the new bit-vector solver has not yet been optimized for theory combination. 
Finally, for quantifier-free string benchmarks, there have been bug fixes since 
CVC4 that affected performance. 

In addition to regularly participating in SMT-COMP, cvc5 and CVC4 also 
participate in the CADE ATP System Competition (CASC) and in SyGuS- 
Comp [103]. In CASC, cvc5 tends to perform in the middle of the pack on 
untyped theorem divisions (unsatisfiable quantified UF in SMT-LIB parlance), 
and towards the top of the pack on theorems with arithmetic. The last time 
SyGuS-Comp was held was in 2019, when CVC4 won four out of five tracks. 
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CVC4 is used extensively in industry, and our users are in the process of 
updating to cvc5. Examples of its use include: a back-end for ZELKOVA, a 
tool developed at Amazon to reason about AWS Access Policies [10, 11,33]; a 
back-end for Boogie [20], which is used in many projects including Dafny [81] 
and the Move Prover [137], a tool used to formally verify smart contracts; a 
back-end at Certora, another company engaged in formal verification of smart 
contracts [138]; a back-end for Sledgehammer [32], a tool for discharging proof 
obligations in Isabelle used by Isabelle’s own industrial users; and a back-end 
for SPARK [70], a development environment for safety-critical Ada programs. 


5 Future Work 


We briefly highlight a few current development directions for Cvc5. 

Optimization Solver. Optimization modulo theories (OMT) [136] is an exten- 
sion of SMT, which requires a solver not only to determine satisfiability but also 
to return a satisfying assignment (if any) that optimizes one or more objectives. 
OMT is already supported by several solvers including MathSAT [46] and Z3. 
cvcd already has internal infrastructure for supporting OMT queries. We aim 
to improve and expose (through the APIs) this capability in the near future. 

Theory of Bags. CVC5 has preliminary support for a theory of multisets (or 
bags) that can be implemented via a reduction to linear integer arithmetic [107]. 
We plan to extend this theory with higher-order combinators such as map and 
fold. With these combinators, and encoding relational tables as bags of tuples, 
cvcd will be able to support several commonly-used table operations, with the 
goal of facilitating reasoning about SQL queries and database applications. 

Floating-Point Arithmetic. In addition to word-blasting, we plan to leverage 
our work on invertibility conditions [36] to lift the local search approach for 
bit-vectors from [93,94] to floating-point arithmetic. 

Internal Portfolio. Due to the computational complexity of SMT, there is 
often no single strategy that works best for all problems. As a result, users of 
SMT solvers often rely on portfolio approaches to try different sets of options, 
either in parallel or sequentially, as we did in Section 4. Implementing portfolio 
approaches that use the solver as a black box is sub-optimal because some work, 
such as parsing, has to be duplicated. The Cvc5 roadmap includes plans to 
support portfolio solving internally, thereby avoiding that additional overhead. 
We further plan to provide predefined portfolios tuned for specific use cases. As 
one example of the different needs of different use cases, some applications prefer 
the solver to always return quickly (even if the answer is “unknown”) whereas 
others expect the solver to try as hard as possible to solve a given problem. 

New Parser. CvC5’s current parser is inherited from CVC4 and is based on 
the ANTLR 3 parser generator [105]. In addition to relying on a now deprecated 
version of ANTLR, the parser is unacceptably slow on large inputs and provides 
no API for user applications to interact with. A new parser using Flex [106] and 
Bison [49] is in development. The new parser will also provide an API allowing 
users to parse whole files or individual terms. 
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Abstract. When augmented with a Pseudo-Boolean (PB) solver, a Boolean sat- 
isfiability (SAT) solver can apply apply powerful reasoning methods to determine 
when a set of parity or cardinality constraints, extracted from the clauses of the 
input formula, has no solution. By converting the intermediate constraints gen- 
erated by the PB solver into ordered binary decision diagrams (BDDs), a proof- 
generating, BDD-based SAT solver can then produce a clausal proof that the input 
formula is unsatisfiable. Working together, the two solvers can generate proofs of 
unsatisfiability for problems that are intractable for other proof-generating SAT 
solvers. The PB solver can, at times, detect that the proof can exploit modular 
arithmetic to give smaller BDD representations and therefore shorter proofs. 


1 Introduction 


Like all complex software, modern satisfiability (SAT) solvers are prone to bugs. In 
seeking to maximize their performance, developers may attempt optimizations that are 
either unsound or incorrectly implemented. Requiring a solver to be formally verified 
is not feasible for current solvers. On the other hand, ensuring that each execution of 
the solver yields the correct result has become a standard requirement. For a satisfiable 
formula, the solver can generate a purported solution, and this can be checked directly. 
For an unsatisfiable formula, the solver can produce a proof of unsatisfiability in a 
logical framework that enables checking by an efficient and trusted proof checker. Proof 
generation is a vital capability when SAT solvers are used for formal correctness and 
security verification, and for mathematical theorem proving. 

Most high-performance, proof-generating SAT solvers are based on conflict-driven, 
clause-learning (CDCL) algorithms [42]. Although the methods used by earlier solvers 
were limited to steps that could be justified within a resolution framework [43, 52], 
modern solvers employ a variety of optimizations that require a more expressive proof 
framework, with the most common being Deletion Resolution Asymmetric Tautology 
(DRAT) [31,50]. Like resolution proofs, a DRAT proof is a clausal proof consisting of a 
sequence of clauses, each of which preserves the satisfiability of the preceding clauses. 
An unsatisfiability proof starts with the clauses of the input formula and ends with an 
empty clause, indicating logical falsehood. The fact that this clause can be derived from 
the original formula proves that the original formula cannot be satisfied. 
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Even with the capabilities of the DRAT framework, some solvers employ reasoning 
techniques for which they cannot generate unsatisfiability proofs. A number of SAT 
solvers can extract parity constraints from the input clauses and solve these as linear 
equations over the integers modulo 2 [6, 30,37,47]. Some can also detect and reason 
about cardinality constraints [6]. However, all these programs revert to standard CDCL 
when proof generation is required. To overcome the proof-generating limitations of cur- 
rent solvers, some have suggested using more powerful proof frameworks, for example, 
based on pseudo-Boolean constraints [27] or Binary Decision Diagrams [5]. Staying 
with DRAT avoids the need to develop, certify, and deploy new proof systems, file 
formats, and checkers. 


Current CDCL solvers do not use the full power of the DRAT framework. In par- 
ticular, DRAT supports adding extension variables to a clausal proof, in the style of 
extended resolution [48]. These variables serve as abbreviations for formulas over ex- 
isting input and extension variables. Compared to standard resolution, allowing exten- 
sion variables can yield proofs that are exponentially more compact [19], and the same 
holds for the extension rule in DRAT. In general, however, CDCL solvers have been un- 
able to exploit this capability, with the exception that some of their preprocessing and 
inprocessing techniques [8, 34] require extension variables [39]. One solver attempted 
to introduce extension variables as it operated [3], but it achieved only modest success. 


In 2006, Biere, Jussila, and Sinz demonstrated that the underlying logic behind al- 
gorithms for constructing Reduced, Ordered Binary Decision Diagrams (BDDs) [10] 
can be encoded as steps in an extended resolution framework [35,46]. By introducing 
an extension variable for each BDD node generated, the logic for each recursive step of 
standard BDD operations can be expressed with a short sequence of proof steps. BDDs 
provide a systematic way to exploit the power of extension variables. The recently de- 
veloped solver PGBDD [11, 12] (for “proof-generating BDD”) builds on this work with 
a more general capability for existentially quantifying variables. It can generate unsat- 
isfiability proofs for several classic challenge problems for which the shortest possible 
standard resolution proofs are of exponential size. 


We show that BDDs can provide a bridge between pseudo-Boolean reasoning and 
clausal proofs. Pseudo-Boolean (PB) constraints have the form 5> j=1,n 4j Tj œb, where 
each variable x; can be assigned value 0 or 1, the coefficients a; and constant b are 
integers, and the relation symbol > is either =, >, or = mod r for some modulus r. 
Both parity and cardinality constraints can be expressed as PB constraints. A PB solver 
can employ Gaussian elimination or Fourier-Motzkin elimination [21,51] to determine 
when a set of constraints is unsatisfiable. Our newly developed program PGPBS (for 
“proof-generating pseudo-Boolean solver’) augments PGBDD with a pseudo-Boolean 
solver, combining the power of PB reasoning with DRAT proof generation. 


To enable proof generation, the PB solver generates BDD representations of its in- 
termediate constraints and has proof-generating BDD operations construct proofs that 
each of these constraints is logically implied by previous constraints. When the PB 
solver reaches a constraint that cannot be satisfied, e.g., the equation 0 = 2, the con- 
straint will be represented by the false BDD leaf L, which yields a proof step consisting 
of the empty clause. The resulting proof is checkable within the DRAT framework with- 
out any reference to pseudo-Boolean constraints or BDDs. Barnett and Biere [5] also 
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proposed using BDDs when proving that the constraints generated by a PB solver were 
logically implied by their predecessors, but they proposed doing so in a separate proof 
framework rather than as the solver operates. 

As an optimization, the PB solver can automatically detect cases where the unsatis- 
fiability proof for an integer-constraint problem can use modular arithmetic. This leads 
to more compact BDD representations, and therefore shorter proofs. 

We demonstrate the power of PGPBS’s combination of BDDs and pseudo-Boolean 
reasoning by showing that that it can achieve polynomial scaling on two classes of 
problems for which CDCL solvers have exponential performance. These include parity 
constraints involving exclusive-or operations [17,49] and cardinality constraints, in- 
cluding the mutilated chessboard [2] and pigeonhole problems [29]. Although PGBDD 
on its own can also achieve polynomial scaling for both classes of problems, incorporat- 
ing pseudo-Boolean reasoning makes the solver much more robust. It can handle wider 
variations in the problem definition, how the problem is encoded as clauses, and the 
BDD variable ordering. It also operates with greater automation, requiring no guidance 
or hints from the user. These capabilities eliminate major shortcomings of PGBDD. 


2 Pseudo-Boolean Constraints 


Let zj, for 1 < j < n, bea set of variables, each of which may be assigned value 
0 or 1, and aj, for 1 < j < n, be a set of integer coefficients. Constant b is also an 
integer. A pseudo-Boolean constraint is of the form > j=1 n 25 Zj > b, with > defining 
the relation between the left-hand weighted sum and the right-hand constant. For an 
integer equation, > is =, 1.e., the two sides must be equal. For an ordering constraint, 
> is >. For a modular equation, > is = mod r, where r is the chosen modulus. 

Three constraint types are of special importance for solving cardinality problems. 
An at-least-one (ALO) constraint is an ordering constraint with a; € {0, +1} for all 
j, and b = +1. An at-most-one (AMO) constraint is an ordering constraint with a; € 
{—1,0} for all j, and b = —1. An exactly-one constraint is an integer equation with 
aj € {0, +1} for all j and b = +1. 


2.1 BDD Representations 


Many researchers have investigated the use of BDDs to represent pseudo-Boolean con- 
straints [1,24,33]. As examples, Figure 1 shows BDD representations of the three forms 
of constraints for n = 10 and b = 0, with a; = +1 for odd values of 7 and —1 for even 
values. The modular equation has r = 3. The BDDs for both the integer equation 
(A) and ordering constraint (B) have an increasing number of nodes at each level for 
the first n/2 levels, with a node at level k for each possible value of the prefix sum 
5 j=1,k—1 4j Tj. As the level k approaches n, however, the number of nodes at each 
level decreases. If a prefix sum becomes too extreme on the negative side, it becomes 
impossible for the remaining values to cause the sum to reach b = 0. For the integer 
equation, a similar phenomenon happens if a prefix sum becomes too extreme on the 
positive side. For an ordering constraint, a sufficiently positive prefix sum will guaran- 
tee that the total sum will be at least 0. For the modular sum (C), the number of nodes 
at any level cannot exceed r—one for each possible value of the prefix sum modulo r. 
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(A) Integer equation (B) Ordering constraint (C) Modular equation 
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Fig. 1. Example BDD representations of pseudo-Boolean equations and ordering constraints. 
Solid (respectively, dashed) lines indicate the branch when the variable is assigned 1 (resp., 0). 
The leaf representing the false Boolean constant L and its incoming edges are omitted. 


Letting @max = MaX1<j<n jaj , the BDD representation of an integer equation or 
ordering constraint will have at most 2 amax - n nodes at any level, while the repre- 
sentation of a modular equation will have at most r nodes at any level. Although large 
values of Gmax (max >> n), can cause the BDDs to be of exponential size [1,33], our 
use of them will assume that both amax and r are small constants. The BDD represen- 
tations will then be O(n?) for integer equations and ordering constraints, and O(n) for 
modular equations. These bounds are independent of the BDD variable ordering. 

Most BDD operations are implemented via the Apply algorithm [10], recursively 
traversing a set of argument BDDs to either construct a new BDD or to test some prop- 
erty of existing ones. The BDDs representing pseudo-Boolean constraints are levelized: 
every branch from a node at level j goes to a leaf node or to a node at level 7 + 1. We 
can therefore derive a bound on the maximum number of recursive steps to perform an 
operation on k argument BDDs, assuming both &max and r are small constants. Due to 
the caching of intermediate results, the maximum number of steps at each level will be 
bounded by the product of the number of argument nodes at this level. The operation 
will therefore have worst-case complexity O(n*+') for integer equations and ordering 
constraints, while it will have complexity O(k - n) for modular equations. 


2.2 Solving Systems of Equations with Gaussian Elimination 


We use a formulation of Gaussian elimination that scales each derived equation, rather 
than dividing by the pivot value [4,44]. Performing the steps therefore requires only 
addition and multiplication. This allows maintaining integer coefficients and automati- 
cally detecting a minimum, possibly non-prime, modulus for equation solving. 
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Consider a system of integer or modular equations E, where each equation e; € E, 
is of the form >> j=1 n ij j = bi. Applying one step of Gaussian elimination involves 
selecting a pivot, consisting of an equation e, € FE and a variable x; such that a, Æ 0. 
Then an equation e/ is generated for each value of i: 


e! _ o ait = 0 (1) 
— lit’ €s Hast’ €i, Git #0 

where operations + and - denote addition and scalar multiplication of equations. Ob- 

serve that a; , = 0 for all equations e;. Letting E + {e;|i # s}, this step has reduced 

both the number of equations in Æ and the number of variables in the equations by one. 

Repeated applications of the elimination step will terminate when either 1) all equa- 
tions have been eliminated, or 2) an unsolvable equation is encountered. For case 1, 
the system has solutions, but these may, in general, assign values other than 0 and 1 
to the variables. (Importantly, parity constraints are represented by modular equations 
with r = 2. Their solutions will be 0-1 valued, and so a SAT solver can make use of 
them [30, 37].) For case 2, if some elimination step generates an equation of the form 
0 = b with b ¥ 0, then this equation has no solution in any case, and therefore neither 
did the original system. Our proofs of unsatisfiability rely on reaching this condition. 

For the modular case, all coefficients and the constants are kept within the range 0 to 
r — 1. For integer equations, the coefficients can grow exponentially in m. Fortunately, 
the cardinality problems we consider only require coefficient values —1, 0, and +1. 

As we have seen, the BDD representations of modular equations have bounded 
width, making them both more compact and making the algorithms that operate on 
them more efficient than for integer equations. As we will see, the unsatisfiability proof 
generated by applying Gaussian elimination to a system of modular equations can be 
significantly more compact than for the same equations over integers. This gives rise to 
an optimization we call modulus auto-detection. The idea is to apply Gaussian elimi- 
nation to a set of integer equations, recording the dependencies between the equations 
generated, but without performing any proof generation. Once the solver reaches an 
equation of the form 0 = b where b ¥ 0, it chooses the smallest r > 2 such that 
b mod r Æ 0. It then generates a proof, reinterpreting the Gaussian elimination steps 
using modulo-r arithmetic. Since the only operations of (1) are multiplication and ad- 
dition, the final equation will be 0 = b (mod r), which has no solution. Here we can 
see that allowing r to be composite is both valid and may be optimal. For example, the 
smallest choice for b = 30 would be r = 4, rather than the prime r = 7. Auto-detection 
can be applied whenever Gaussian elimination encounters an unsolvable equation. 


2.3 Solving Systems of Ordering Constraints with Fourier-Motzkin Elimination 


Consider a set C, consisting of constraints c; of the form X` j=1,n lij Zj Z bi. Applying 
one step of Fourier-Motzin elimination [21,51] to this system involves identifying a 
pivot, consisting of a variable x; such that a, 4 0 for at least one value of k. The set is 
partitioned into three sets by assigning each constraint c; to C+, C7, or C°, depending 
on whether coefficient a; is positive, negative, or zero, respectively. For each pair 7 
and 2’ such that c; € C+ and c; € C~, a new constraint C; x is generated as: 


Cii = — ai t’ Ci + ait: Cy (2) 
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Augmented Solver PGPBS 


BDD-based one 
SAT Constraints |" Pseudo- DRAT 
CNF Solver Boolean 
File Extractor | Schedul PGBDD 
nics Cneduig BDD; | Solver 


Fig. 2. Overall Structure of PGPBS. It augments the BDD-based SAT solver PGBDD with infer- 
ences from a pseudo-Boolean constraint solver. The constraint extractor is a separate program. 


(Note that the multiplication is always by positive values, since a; 4 < 0.) Letting 
CH C° U {civ |c; € Ct, cy € C7}, all of these constraints have coefficient 0 for 
variable x;. Therefore this step has reduced the number of variables in the constraints 
by one, but it may have increased the number of constraints. 

As with Gaussian elimination, repeated application of the elimination step will ter- 
minate when either 1) all variables have been eliminated or 2) an unsolvable constraint 
is encountered. With case 1, the constraints can be satisfied, although possibly by as- 
signing values other than 0 or 1 to some of the variables. An unsolvable constraint (case 
2) is one where the sum of the positive coefficients is less than the constant term. If such 
a constraint is encountered, then the original system of constraints has no solution. 

Fourier-Motzkin elimination would appear to be hopelessly inefficient. The number 
of constraints can grow exponentially as the elimination proceeds, and the coefficients 
can grow doubly exponentially. Fortunately, the cardinality problems we consider have 
the property that for any variable x+, there is at most one constraint c; having a; = 
+1, at most constraint c; having ay, = —1, and no other constraint with a non-zero 
coefficient at position t. This property is maintained by each elimination step, and so 
the number of constraints will decrease with each step, and the coefficients will be 
restricted to the values —1, 0, and +1. 


3 Overall Operation 


Figure 2 illustrates the program structure. The pair of programs—extractor and solver— 
supports the standard flow for proof-generating SAT solvers, reading the input conjunc- 
tive normal form (CNF) formula expressed in the standard DIMACS format and gen- 
erating proofs in the standard DRAT format. No other guidance or hint is provided. 
The constraint extractor identifies pseudo-Boolean constraints encoded as clauses in 
the input file and generates a schedule indicating how clauses should be combined and 
quantified to derive BDD representations of the constraints. PGPBS augments the SAT 
solver PGBDD with a PB solver. PGBDD supplies the constraints to the PB solver, which 
applies either Gaussian elimination or Fourier-Motzkin elimination. The PB solver gen- 
erates BDD representations of the constraints it generates, and, since the BDD library 
generates proof steps while performing BDD operations, it can generate a proof that 
each new constraint is logically implied by previous constraints. When the PB solver en- 
counters an unsolvable constraint, an empty clause is generated, completing the proof. 
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(A) (B) (C) 
Exclusive-Or/Nor Exactly-one, direct encoding At-most-one, Sinz encoding [45] 
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Fig. 3. Examples of pseudo-Boolean constraints extracted from CNF representations. Schedules 
use a stack notation indicating clauses, conjunction and quantification operations, and constraints. 


3.1 Constraint Extraction 


The constraint extractor uses heuristic methods to identify how the input clauses 
match standard patterns for exclusive-or/nor, ALO, and AMO constraints. The heuris- 
tics are independent of any ordering of the clauses or variables, although they do de- 
pend on the polarities of the literals. The generated schedule indicates how to combine 
clauses and to quantify variables to give the different constraints. The schedule uses a 
stack notation, having the following commands: 


C45 tinik Generate and push the BDDs for the specified clauses. 

am Pop the top m + 1 elements. Combine with m AND 
operations. Push the result. 

q U1,.--,Uk Quantify the top element by the specified variables. 

C bay,.v1,...,@k.Uz Confirm that the top stack element implies the constraint 


The different constraint types C are ‘=’ for integer equations, ‘=2’ for mod-2 equations, 
and ‘>=’ for integer orderings. Each constraint line lists the constant b and then indicates 
the non-zero terms as a combination of coefficient and variable, separated by *.’. 

Figure 3 provides a series of examples illustrating the operation of the extractor. A 
k-way exclusive-or or exclusive-nor (A) is encoded with 2*-1 clauses (here k = 3), 
listing all combinations of the negated variables having even (XOR) or odd (XNOR) 
parity. The schedule lists the clause numbers, forms their conjunction, and indicates a 
mod-2 equation. The constant b is 1 for exclusive-or and 0 for exclusive-nor. 
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An exactly-one constraint (B) can be expressed as a combination of an ALO con- 
straint and an AMO constraint. The extractor assumes that any clause with all literals 
having positive polarity encodes an ALO constraint. In this example, a k-way AMO 
constraint (k = 4) is encoded directly as a set of k (k — 1)/2 binary clauses. 

An AMO constraint can be also encoded with auxiliary variables (B) in variety 
of ways, including that devised by Sinz [45]. The extractor examines how variables 
occur in binary clauses. Those that occur only with negative polarity are assumed to be 
constraint variables, while those that have mixed polarity are assumed to be auxiliary 
variables. As is shown, the generated schedule for an AMO constraint encoded with 
auxiliary variables employs early quantification [13] to linearize the conjuncting of 
clauses and the quantification of auxiliary variables. 

The heuristics used for identifying auxiliary variables and partitioning the clauses 
into distinct constraints apply to a wide range of AMO constraints, including those us- 
ing hierarchical encodings [16,36] and those considered in other constraint extraction 
programs [9]. Our method can be overly optimistic, labeling some subsets of clauses 
incorrectly. Fortunately, any such error will be quickly identified when the solver at- 
tempts to prove that the BDD generated by conjuncting the clauses and quantifying the 
auxiliary variables implies the BDD generated for the constraint. 


3.2 Solver Operation 


The SAT solver portion of PGPBS can generate BDD representations of input clauses 
and perform conjunction and existential quantification operations on BDDs [11, 12]. 
As the solver manipulates BDDs to track the solution state, it also generates clauses 
according to resolution and extension proof rules. The state of the solver at any time is 
captured by a set of terms Ti, T3, . . . , Tn, where each term T; consists of: 


A root node u; in the BDD. 

The extension variable associated with this node, also written as u;i. 

A unit clause, included in the proof clauses, consisting of extension variable u;, 
asserting that the Boolean function represented by BDD node u; evaluates to true 
for any variable assignment that satisfies the input clauses. 

Implicitly, the set 0(u;) of all defining clauses that were added to the proof when 
introducing the extension variables for the nodes in the BDD subgraph having root 
u;i. These provide the semantic model for the BDD within the proof framework. 


The BDD package supports proof-generating BDD operations APPLYAND, used to 
perform conjunction, and PROVEIMPLICATION, used to generate proofs of implication. 
The APPLYAND operation takes as arguments BDD roots u and v, and it generates 
a BDD representation with root w of their conjunction. It also generates a proof of 
the clause ŭ V U V w, proving the implication u A v —> w. The PROVEIMPLICATION 
operation performs implication testing without generating any new BDD nodes. It takes 
as arguments BDD roots u and v, and it generates a proof of the clause Ñ V v, proving 
that u — v. An error is signaled if the implication does not hold. 

When the solver encounters a clause command in the schedule file, it generates a 
term T; for each of the specified input clauses C; and pushes the term onto a stack. It 
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also generates the proof 80 (u;), C; F uj, i.e., that function represented by BDD node u; 
will evaluate to true for any variable assignment that satisfies the clause. 


When the solver encounters a conjunction or quantification command, it creates a 
new term by performing the specified operation and proving that it is implied by earlier 
terms. Given newly generated BDD root un+1, it must prove that u,+1 is implication 
redundant with respect to the existing terms. That is, if u,+41 was generated by applying 
some operation to terms T; , Tiz,- --, Tip, then it must generate a proof of the clause 
Ui, V Ui, V ++ V Ti, V Un41. This clause can then be resolved with the unit clauses 
associated with the existing terms to yield the unit clause u,,41, allowing a new term 
Tn+1 to be added. If some step generates a term T),41 with BDD representation un+1 = 
L, it will also generate the empty clause, completing a proof of unsatisfiability. 


The PB solver portion of PGPBS can generate BDD representations of the inter- 
mediate constraints it creates. The SAT solver generates a new term for each of these 
BDDs. The proof generator need not have any understanding of the operation of the PB 
solver, and vice-versa. Suppose some set of input clauses encodes a pseudo-Boolean 
constraint, possibly using auxiliary variables, as was illustrated in Figure 3. The SAT 
solver performs the series of conjunction and quantification operations specified by the 
schedule to reduce the clauses to a single term T, consisting of BDD root u,, and unit 
clause un. The auxiliary variables have been quantified away, and so un depends only 
on the constraint variables. It passes the constraint to the PB solver, which generates 
its BDD representation with root u,+1. The SAT solver uses the PROVEIMPLICATION 
operation to generate the clause Un V Un+1. This can be resolved with unit clause un 
to generate the unit clause u,,;1, and so the BDD representation of the constraint be- 
comes term Tn+1. (Typically, the two BDDs are identical and so the implication holds 
trivially.) This process is repeated to convert the input formula into a set of pseudo- 
Boolean constraints, each represented as a term in the SAT solver. 


Once the SAT solver has converted all of the input clauses into constraints, it passes 
control to the PB solver. From that point on, the SAT solver serves in a support role, 
generating proofs to justify the steps of the PB solver. As the PB solver operates, it gen- 
erates a BDD representation of each new constraint: for each equation e} generated by 
Gaussian elimination (1) or each ordering constraint c; ; generated by Fourier-Motzkin 
elimination (2). For anew BDD with root un+ı generated from constraints represented 
by terms T; and T}, it uses the APPLYAND operation to generate the conjunction w 
of the BDDs with roots u; and uj, as well as a proof of the clause Ñ; V U; V w. It 
then uses the PROVEIMPLICATION operation with arguments w and un+1 to generate 
a proof of the clause W V un+1. It can then resolve the unit clauses for terms T; and T; 
with the generated clauses to generate a proof of the unit clause u,,+1, and so the BDD 
representation of the constraint becomes term T;,41. When some step of the PB solver 
generates an unsolvable equation or ordering constraint, it encodes the constraint as the 
false BDD leaf L, and the SAT solver will generate the empty clause. 


As an optimization, we implemented an operation APPLYANDPROVEIMPLICATION 
combining the functions of APPLYAND and PROVEIMPLICATION. It takes as arguments 
BDD roots u, v, and w and generates a proof that u A v — w without constructing the 
BDD representation of u A v. We found this reduced the total proof lengths by over 2x. 
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Fig. 4. Total number of clauses in proofs of two sets of Urquhart formulas. 


4 Experimental Results 


PGPBS is written in Python with its own BDD package and pseudo-Boolean constraint 
solver. The Gaussian elimination solver employs a standard greedy pivot selection 
heuristic, attributed to Markowitz [23,41], that seeks to minimize the number of non- 
zero coefficients created. The Fourier-Motzin solver uses a similar heuristic for select- 
ing pivot variables. 

The operation of PGPBS follows the flow illustrated in Figure 2, with constraints 
extracted directly from the input CNF file, and with the generated schedule driving the 
operation of the solver. Some measurements were taken using a BDD variable ordering 
according to their numbering in the input file, while others used a random BDD variable 
ordering to assess the sensitivity to the variable ordering. All generated proofs were 
checked with an LRAT proof checker [20]. We used KISSAT, winner of the 2020 SAT 
competition [7], as a representative CDCL solver. All measurements labeled “PGBDD” 
are for the earlier version of the solver, without pseudo-Boolean reasoning [11, 12]. 

We measure the performance of the solvers in terms of the total number of clauses 
in the generated proofs of unsatisfiability. This metric tracks closely with the solver 
runtime and has the advantage that it is machine independent. We set an upper limit of 
100 million clauses for the proof sizes for the three measured solvers. 


4.1 Urquhart Parity Formulas 


Urquhart [49] defined a family of formulas that require resolution proofs of exponential 
size. Over the years, two sets of SAT benchmarks have been labeled as “Urquhart Prob- 


3 PGPBS, PGBDD, and the code for generating and testing a set of benchmarks, are available at 
https://github.com/rebryant/pgpbs-artifact and as https://doi.org/10.528 1/zenodo.5907086. 
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lems” [15,38]. The formulas are defined over a class of degree-5, undirected, bipartite 
graphs, parameterized by a size m, with the graph having 2m? nodes. To transform a 
graph into a formula, each edge {7, 7} in the set of edges E has an associated variable 
Tti j}. (We use set notation to emphasize that the order of the indices does not matter.) 
Each vertex is assigned a polarity p; € {0,1}, such that the sum of the polarities is odd. 
The clauses then encode that the sum for all values of 7 and j of £{;, j} + pi equals 0 
modulo 2. This is false of course, since each edge is counted twice in the sum, and the 
sum of the polarities is odd. 

The two families of benchmarks differ in how the graphs are constructed. Li’s 
benchmarks are based on the explicit construction of expander graphs [26, 40], upon 
which Urquhart’s lower bound proof is based. Simon’s benchmarks are based on ran- 
domly generated graphs and thus depend on the random seed. We generated five dif- 
ferent formulas for each value of m. Simon’s graphs are not guaranteed satisfy the 
expander property, but they still provide challenging benchmarks for SAT solvers. 

Figure 4 shows the performance of the solvers, measured as the number of clauses 
as a function of m, for both Simon’s and Li’s benchmarks. The smallest instances of 
the benchmark have m = 3. As can be seen KISSAT is able to generate proofs for the 
Simon version for four cases with m = 3 and one with m = 4, but it is unable to 
handle any other cases, including not even the minimum instance for Li’s benchmark. 
Measurements are shown for PGBDD running bucket elimination, a simple algorithm 
that processes clauses and intermediate terms with conjunction and quantification oper- 
ations according to the levels of the topmost variables [22,35]. It achieves polynomial 
scaling on both benchmarks, with only mild sensitivity to the random seeds. Running 
PGPBS with modulo-2 equation solving improves the performance even further, such 
that we were able to handle both families of benchmarks up to m = 48. Considering 
that the problem grows quadratically in m, this represents a major improvement over 
KISSAT. 


4.2 Other Parity Constraint Benchmarks 


Chew and Heule [17] introduced a benchmark based on Boolean expressions computing 
the parity of a set of Boolean values x1,..., £n using two different orderings of the 
inputs, with a randomly chosen variable negated in the second computation. The SAT 
problem is to find a satisfying assignment that makes the two expressions yield the same 
result—an impossibility due to the negated variable. With KISSAT, we found the results 
were very sensitive to the choice of random permutation, and so we ran the solver for 
five different random seeds for each value of n. We were able to generate proofs for 
instances with n up to 47, but we also encountered cases where the proofs exceeded the 
100-million clause limit starting with n = 40. The overall scaling is exponential. 

Chew and Heule showed they could generate proofs for this problem that scale as 
n log n. Using bucket elimination, PGBDD is able to obtain polynomial performance, 
handling up to n = 3,000 with a proof of 61 million clauses. PGPBS is able to apply 
Gaussian elimination with modulus r = 2, obtaining even better performance than did 
Chew and Heule. For n = 10,000, Chew and Heule’s proof has 14 million clauses while 
the proof generated by PGPBS has less than 7 million. 
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Mutilated Chessboard Clauses 


108 E T T E 
107 E | 
10° E = 
105 E = 
L —e— KISSAT 
104 = —¢— PGPBS, Integer Equations, Input Order 
Z —4— PGBDD, Column Scan, Input Order 
E —¢— PGPBS, Mod-3 Equations, Input Order 
102 | I I 
4 8 16 32 64 128 
n 


Fig. 5. Total number of clauses in proofs of n x n mutilated chess board problems. 


Elffers and Nordström created the TSEITINGRID family of benchmarks for the 2016 
SAT competition, based on grid graphs having fixed width but variable lengths [25]. 
These are designed to be challenging for SAT solvers while having polynomial scaling. 
The 2020 SAT competition included two instances of this benchmark, with 7 x 165 and 
7 x 185 grids. None of the entrants could generate an unsatisfiability proof for either 
instance within the 5000 second time limit. On the other hand, PGPBS can readily 
handle both, generating proofs with less than 500,000 clauses and requiring at most 63 
seconds. Indeed, PGPBS can solve the largest published instance, having a 7 x 200 grid, 
in 76 seconds. Clearly, parity constraint problems pose no major challenge for PGPBS. 


4.3 Variants of the Mutilated Chessboard 


The mutilated chessboard problem considers an n x n chessboard, with the corners on 
the upper left and the lower right removed. It attempts to tile the board with dominos, 
with each domino covering two squares. Since the two removed squares had the same 
color, and each domino covers one white and one black square, no tiling is possible. 
This problem has been well studied in the context of resolution proofs, for which it can 
be shown that any proof must be of exponential size [2]. 

The standard CNF encoding defines a Boolean variable for each possible horizon- 
tal or vertical domino placement. For each square, it encodes an exactly-one constraint 
for the set of dominos that could cover that square. Both the number of variables and 
the number of clauses scale as O(n”). Figure 5 shows the performance of the different 
solvers as a function of n. KISSAT scales exponentially, hitting the 100-million clause 
limit with n = 20. The plot labeled “Column Scan” demonstrates that PGBDD per- 
forms very well on this problem when given a carefully crafted schedule and the proper 
variable ordering [11], requiring less than 20 million clauses for n = 128. 
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Mutilated Chess Board/Torus Clauses 


108 £ - 
107 E e 
10° = 
10° E E 
L- —4— Board, PGBDD, Column Scan, Random Order } 
104 E —+— Torus, PGBDD, Column Scan, Input Order E 
[F —¢— Torus, PGPBS, Autodetect, Random Order 4 
—+— Board, PGPBS, Autodetect, Random Order H 

1 3 | I T 

9 4 8 16 32 64 128 
n 


Fig. 6. Stress Testing: Changing the topology and variable ordering for mutilated chess. Autode- 
tection enables the PB solver to use modulo-3 arithmetic. 


The plot labeled “Integer Equations, Input Ordering” shows that PGPBS can achieve 
polynomial scaling on this problem when performing Gaussian elimination on integer 
equations. It does not scale as well as column scanning, reaching n = 96 before hitting 
the clause limit. (The unevenness of the plot appears to be an artifact of the randomiza- 
tion used to break ties during pivot selection.) 

Looking deeper, we can see that solver avoids the worst-case performance for Gaus- 
sian elimination on this problem. Let us assume that the omitted corners are both white, 
and so the board has k black squares and k — 2 white squares, where k = n?/2. Each 
variable occurs in one equation for a black square and in one for a white square. If we 
were to sum all of the equations for the black squares, we would get >> j=1,m Tj = k, 
where m is the number of variables. Similarly, summing the equations for the white 
squares gives )> j=1m Ti = k — 2. Subtracting the second equation for the first gives 
the unsolvable equation 0 = 2. These sums and differences can be performed using 
pseudo-Boolean equations with coefficients 0 and +1. Although Gaussian elimination 
combines equations in a different order, it maintains the property that the coefficients 
are limited to values —1, 0, and +1. 

The plot labeled “Mod-3 Equations, Input Ordering” demonstrates the benefit of 
modular arithmetic when solving systems of equations. The equation 0 = 2, obtained 
by integer Gaussian elimination for this problem, has no solution for any odd modulus; 
modulus auto-detection chooses r = 3. This optimization achieves better scaling, due 
to the bounded width of the BDD representations. Indeed, it outperforms the best results 
obtained with PGBDD, generating a proof with less than 8 million clauses for n = 128. 
For the remaining measurements, we assume that modulus auto-detection is enabled. 

The plots of Figure 6 illustrate how pseudo-Boolean reasoning makes PGPBS more 
robust than PGBDD. First, we consider the extension of the mutilated chessboard prob- 


456 R. E. Bryant, A. Biere, and M. J. H. Heule 


Pigeonhole Clauses 
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Fig. 7. Total number of clauses in proofs of pigeonhole problem for n holes 


lem to a torus, with the sides of the board wrapping around both vertically and hor- 
izontally. As the plot labeled “Torus, PGBDD, Column Scan, Input Order” indicates, 
the performance of column scanning disintegrates for this seemingly minor change. 
The compact state encoding exploited by column scanning works only when there is 
a single frontier as the variables are processed from left to right. Second, the plot la- 
beled “Board, PGBDD, Column Scan, Random Order” illustrates that column scanning 
is highly sensitive to the chosen BDD variable ordering. On the other hand, the four 
versions using auto-detected modular equations are only mildly sensitive to the topol- 
ogy (torus or board) or the variable ordering (input or random). For both topologies, the 
clause counts for the two different orderings (input and random) are so close to each 
other that they cannot be distinguished on the log-log scale. and so we show only the 
results for random orderings. These results show that pseudo-Boolean reasoning over- 
comes several major weaknesses of the pure Boolean methods of PGBDD. With its PB 
solver, PGPBS requires no guidance from the user regarding how to process the clauses, 
nor does it require any guidance or heuristics to choose a good BDD variable ordering. 
Furthermore, it is less sensitive to the problem definition. 


4.4 Pigeonhole Problem 


The pigeonhole problem is one of the most studied problems in propositional reasoning. 
Given a set of n holes and a set of n+1 pigeons, it asks whether there is an assignment of 
pigeons to holes such that (1) every pigeon is in some hole, and (2) every hole contains 
at most one pigeon. The answer is no, of course, but any resolution proof for this must 
be of exponential length [29]. 
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The problem can be encoded into CNF with Boolean variables p; j, for 1 <i < n 
and 1 < j < n + 1, indicating that pigeon 7 is placed in hole 7. A set of n AMO 
constraints indicates that each hole can contain at most one pigeon, and n + 1 ALO 
constraints indicate that each pigeon must be placed in some hole. We experimented 
with two different encodings for the AMO constraints: the direct encoding requiring 
n(n + 1)/2 clauses per hole, and the Sinz encoding [45], requiring 3n — 1 clauses. 


Figure 7 shows the total number of clauses (input plus proof) as functions of n 
for this problem. KISSAT performs poorly, reaching the 100-million clause limit with 
n = 14 for the direct encoding and n = 15 for the Sinz encoding. Using PGBDD, we 
were unable to find any strategy that gets beyond n = 16 with a direct encoding. Our 
best results came from a “tree” strategy, simply forming the conjunction of the input 
clauses using a balanced tree of binary operations. For the Sinz encoding, on the other 
hand, we devised a column scanning technique similar to the method used to solve the 
mutilated chessboard problem. This approach scales very well, empirically measured 
as O(n). The proofs stay below 100 million clauses up to n = 128, although it can 
only reach n = 17 with a random variable ordering (plot not shown). 


Using pseudo-Boolean reasoning with Fourier-Motzkin elimination, we were able 
to achieve polynomial scaling, reaching n = 34 with both encodings and for both input 
and random ordering. The four results are so similar that they are indistinguishable on a 
log-log plot, and so we show the average for the two encodings with random orderings. 
Observe that each variable p; j occurs with coefficient —1 in the AMO constraint for 
hole 2 and with coefficient +1 in the ALO constraint for pigeon 7. Thus, as described in 
Section 2.3, each step of Fourier-Motzkin elimination reduces the number of constraints 
by at least one, with the coefficients restricted to the values —1, 0, and +1. Indeed, it 
can be seen that the solver, in effect, sums the n AMO and n + 1 ALO constraints to 
get the unsolvable constraint 0 > 1. The scaling of proof sizes, empirically measured 
as O(n”), is limited by the O(n”) growth of the BDD representations for the ordering 
constraints, as was illustrated in Figure 1C. 


The plot labeled “Sinz, PGPBS, Equations, Random Order” demonstrates the effect 
of adding constraints to enforce exactly-one constraints on both the pigeons and the 
holes. The solver applies modulus auto-detection to give a modulus of r = 2. Modulo- 
2 reasoning enables the solver to match the performance of column scanning, with 
the further advantages of being fully automated and being insensitive to the variable 
ordering. However, it requires additional constraints in the input file. 


Finally, the plot labeled “Direct, Cook’s Proof” shows the complexity of Cook’s 
extended-resolution proof of the pigeonhole problem [19], encoded in DRAT format. 
Although it is very concise for small values of n, its scaling as @(n*) lies between the 
O(n?) achieved by column scanning and equation solving, and the @(n°) achieved by 
constraint solving. Of these, only Cook’s proof and the solution by constraint solving 
are directly comparable, in that only these use a direct encoding and have only the 
minimum set of AMO and ALO constraints. 


In summary, pseudo-Boolean reasoning makes this problem tractable with full au- 
tomation, and it has minimal sensitivity to the variable ordering. Generating proofs by 
solving systems of ordering constraints is more challenging than by solving automati- 
cally detected modular equations, but both achieve polynomial scaling. 
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4.5 Other Cardinality Constraint Problems 


Codel et al. [18] defined a general class of problems that includes the mutilated chess- 
board and the pigeonhole problems as special cases. Given a bipartite graph with ver- 
tices L and R such that |L| < |R], the problem is to find a perfect matching, i.e., a 
subset of the edges such that each vertex has exactly one incident edge. For the muti- 
lated chessboard, L and R correspond to the white and black squares, respectively, with 
edges based on chessboard adjacencies. For pigeonhole, L corresponds to the holes and 
R to the pigeons, and the graph is the complete bipartite graph Kn,n+1. No instance of 
this matching problem has a solution, since the sets of nodes are of unequal size. 

Twelve instances of this problem were included in the 2021 SAT competition, based 
on randomly generated graphs with n = || ranging from 15 to 20 and with |R| = 
n + 1. Different methods were used to encode the AMO constraints, and some included 
clauses to convert both sets of constraints into exactly-one constraints. In the compe- 
tition, all of the solvers could easily handle the benchmarks with n = 15, most could 
handle n = 16, with typical runtimes of around 1000 seconds, but none could solve 
any of the larger problems. PGPBS can easily handle all of the benchmarks, requiring 
at most 13 seconds and generating proofs with less than 500,000 clauses. 


5 Conclusions 


Incorporating pseudo-Boolean reasoning into a SAT solver enables it to handle classes 
of problems encoded in CNF that are intractable for CDCL solvers. By having the PB 
solver generate BDD representations of its intermediate results, a BDD-based, proof- 
generating SAT solver can generate clausal proofs of unsatisfiability on behalf of the PB 
solver in the standard, DRAT proof framework. Compared to the SAT solver operating 
on its own, including a PB solver enables greater automation with less sensitivity to 
problem definition, encoding method, and variable ordering. 

We have shown that applying pseudo-Boolean reasoning to unsatisfiable instances 
of parity and cardinality constraint problems can yield proofs that scale polynomially. 
Solving systems of equations over the integers modulo 2 yields 0-1 valued solutions, 
and so parity reasoning can also be used on satisfiable problems [6, 30, 37, 47]. On 
the other hand, Gaussian elimination over integers or with modulus r > 2, as well 
as Fourier-Motzkin elimination, are not guaranteed to find 0-1 valued solutions. When 
seeking solutions with cardinality reasoning, it seems more effective to use methods 
that adapt CDCL-based search to pseudo-Boolean constraints [14]. 

The method described here can be generalized to incorporate other reasoning meth- 
ods into a proof-generating SAT solver. As long as intermediate results can be expressed 
as BDDs, a proof can be generated that the result of each step logically follows from the 
preceding steps. Thus, we could incorporate other pseudo-Boolean reasoning methods, 
such as cutting planes [28,32], or we could add totally different reasoning methods. 
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Abstract. Augmenting problem variables in a quantified Boolean formula with 
definition variables enables a compact representation in clausal form. Generally 
these definition variables are placed in the innermost quantifier level. To re- 
store some structural information, we introduce a preprocessing technique that 
moves definition variables to the quantifier level closest to the variables that de- 
fine them. We express the movement in the QRAT proof system to allow verifica- 
tion by independent proof checkers. We evaluated definition variable movement 
on the QBFEVAL 20 competition benchmarks. Movement significantly improved 
performance for the competition’s top solvers. Combining variable movement 
with the preprocessor BLOQQER improves solver performance compared to us- 
ing BLOQQER alone. 


1 Introduction 


Boolean formulas and circuits can be translated into conjunctive normal form (CNF) by 
introducing definition variables to augment the existing problem variables. Definition 
variables are introduced through a set of defining clauses, given by the Tseitin [19] or 
Plaisted-Greenbaum [16] transformation. Problem variables occurring in the defining 
clauses constitute the defining variables; they effectively determine the values of the 
definition variables. In CNF, definitions are not an explicit part of the problem repre- 
sentation, preventing solvers from using this structural information. Quantified Boolean 
formulas (QBF) extend CNF into prenex conjunctive normal form (PCNF) with the ad- 
dition of quantifier levels. In practice, definition variables are usually placed in the 
innermost quantifier level. However, as we will show, placing a definition variable in 
the quantifier level immediately following its defining variables can improve solver per- 
formance. 

We describe a preprocessing technique for moving definition variables to the quanti- 
fier level of their innermost defining variables. As a starting point, existing tools KISSAT 
and CNFTOOLS can detect definitions in a CNF formula. We process and order the can- 
didate definitions, moving definition variables sequentially. For each instance of move- 
ment we generate a proof in the QRAT proof system that, through a series of clause 
additions and deletions, effectively replaces the old definition variable with a new vari- 
able at the desired quantification level. 
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Most Boolean satisfiability (SAT) solvers generate proofs of unsatisfiability for in- 
dependent checking [7,9,20]. This has proved valuable for verifying solutions inde- 
pendent of the (potentially buggy) solvers. Proof generation is difficult for QBF and 
relatively uncommon in solvers. The QBF preprocessor BLOQQER [2] generates QRAT 
proofs [8] for all of the transformations it performs. Our QRAT proofs for variable 
movement also allow verification with the independent proof checker QRAT-TRIM, 
ensuring that the movement preserves equivalence with the original formula. 

Clausal-based QBF solvers rely on preprocessing to improve performance. Almost 
every top-tier solver in the QBFEVAL’ 20 competition! used some combination of BLO- 
QQER, HQSPRE [21], or QBFRELAY [15]. Some solvers incorporate preprocessing 
techniques into the solving phase, e.g., DEPQBF’s [14] use of dynamic quantified 
blocked clause elimination. Unlike other preprocessing techniques, variable movement 
does not add or remove clauses or literals. However, it can prompt the removal of literals 
through universal reduction and may guide solver decisions in a beneficial way. 

The contributions of this paper include: (1) adapting the SAT solver KISSAT and 
CNF preprocessor CNFTOOLS to detect definitions in a QBF, (2) giving an algorithm 
for moving variables that maximizes variable movement, (3) formulating steps for gen- 
erating a QRAT proof of variable movement, and (4) evaluting the impact of these trans- 
formations. Variable movement significantly improves the performance of top solvers 
from the QBFEVAL’20 competition. Combining variable movement with BLOQQER 
further improves solver performance. 


2 Preliminaries 


2.1 Quantified Boolean Formulas 


Quantified Boolean formulas (QBF) can be represented in prenex conjunctive normal 
form (PCNF) as H.y, where IT is a prefix of the form Q1 X1Q2X2---QnXn for 
Qi € {V,A} and the matrix ~ is a CNF formula. The formula ~ is a conjunction of 
clauses, where each clause is a disjunction of literals. A literal / is either a variable 
l = zx or negated variable 1 = 7, and Var(l) = x. The formula ~(I) is the clauses 
{C | C € 4,l € C}. The set of all variables occurring in a formula is given by 
Var(w). Substituting a variable y for x in Y, denoted as ~[y/2], will replace every in- 
stance of x with y and 7 with y in the formula. The sets of variables X; are disjoint, 
and we assume every variable occurring in 7 is in some X;. A variable x is fresh if it 
does not occur in JT.). The quantifier for literal with Var(1) € X; is QU7,l) = Qi, 
and J is said to be in quantifier level \(1) = i. If Q(II, l) = Qi and Q(II, k) = Qj, 
then l <q kifi < j. Q1Xı is referred to as the outermost quantifier level and Q, Xn 
is the innermost quantifier level. 


2.2 Inference Techniques in QBF 


Given a clause C, if a literal l € C is universally quantified, and all existentially quan- 
tified literals k € C' satisfy k <z l, then l can be removed from C. This process is 


' available at http://www.qbflib.org/qbfeval20.php 
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called universal reduction (UR). Given two clauses C and D with x € C and 7 € D, 
the Q-resolvent over pivot variable x is UR(C) U UR(D) \ {x, Z} [12]. The operation 
is undefined if the result is tautological. This extends resolution for propositional logic 
by applying UR to the clauses before combining them, while disallowing tautologies. 
Adding or removing non-tautological Q-resolvents preserves logical equivalence. 

Given a prefix JT and clauses C and D with | € C and l € D, the outer resolvent 
over existentially quantified pivot literal lis C U {k | k € D,k # l,k <r I}. Given 
a QBF JI., a clause C is Q-blocked on some existentially quantified literal | € C if 
for all D € w/(1) the outer resolvent of C with D on | is a tautology. This extends the 
blocked property for CNF with the restriction on the conflicting literal’s quantifier level. 

A clause C subsumes D if C C D. The property Q-blocked-subsumed generalizes 
Q-blocked by requiring the outer resolvents be tautologies or subsumed by some clause 
in the formula. 

Given a QBF W = I1.1), if a clause C is Q-blocked-subsumed then C is QRAT 
w.r.t. W. In this case, C can be added to w or if C € w deleted from w while preserving 
equivalence. A series of clause additions and deletions resulting in the empty formula 
is a satisfaction proof for a QBF if all clause deletions are QRAT. A series of clause 
additions and deletions deriving the empty clause is a refutation proof for a QBF if all 
clause additions are QRAT. If both clause additions and deletions are QRAT, each step 
preserves equivalence regardless of the truth value of the QBF. We call this a dual proof. 
The QBF W that results from applying the dual proof steps to ¥ is equivalent to W. 


2.3 Definitions 


A variable x is a definition variable in ¥ = I7.w with defining clauses (x) containing 
x, 6(£) containing %, and defining variables Z, = Var|ô(x) U 6(&)] \ {x} when two 
properties hold: (1) the definition is /eft-total, meaning that for every assignment of Zy 
there exists a value of x that satisfies 6(a) Ud(£), and (2) the definition is right-unique, 
meaning that for every assignment of Z, there exists exactly one value of x that satisfies 
ô(x) U 6(). The clauses (x) U 6() are left-total iff they are Q-blocked on variable 
x. This implies that the definition variable comes after the defining variables w.r.t. I. 
The definition is right-unique if the SAT problem {C \ {x,Z} | C € 6(x) U 6(#)} is 
unsatisfiable. We can assume that any right-unique variable is existentially quantified, 
otherwise the formula would be trivially false. 

The remaining clauses of x are p(x) = w(x) \ d(x) and p(T) = Y(T) \ (z). If x 
occurs as a single polarity in the remaining clauses, it can be encoded as a one-sided 
definition: if p(x) is empty only 6(x) are needed to determine if x is assigned to true 
and therefore unable to satisfy the clauses in p(z). This is a stronger condition than 
monotonicity used for the general Plaisted-Greenbaum transformation [16]. 


Example 1. x 4 a ^ bis written in CNF as (x V a V b) A (Z V a) A (T V b). Given 
p(x) = {(x Vc), (x V d V e)} and p(T) = {}, z = a ^ b can be written as a one-sided 
definition with clauses (T V a) A (© V b). 


In some definitions including exclusive-or (XOR denoted by ®), multiple variables 
are left-total and right-unique. Determining the definition variable requires information 
about how definition variables are nested within the formula. 
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Q-resolution can be generalized to sets of clauses C and D, denoted C ©, D, by 
generating the non-tautological resolvents from clauses in C(x) and D(%) on pivot 
variable x pairwise. Given a definition variable x and defining variables {z1,..., Zn}, 
let x’ be a fresh variable with 6, = ô(x) and 0g = ô(T)|x'/x]. The procedure 
defining variable elimination applies set-based Q-resolution in the following way: set 
01 = (21) Qz, Or (Z1) A On (%1) @z, O7 (z1) and compute 02 = 61(z2) @z, 01 (Z2); 
continue the process until 6, = On—1(2n) ®z,, 9n—1(Zn). UR is not applied because x 
is in the innermost quantifier level with respect to its defining variables. The first step 
ensures all clauses in 6; will contain both x and z’. 0, will either be {(x V x’)} or 
empty. If 0, = {(x V x')}, linearizing the sets of resolvents 0; forms a Q-resolution 
derivation of (x V x’). This is similar to Davis Putnam variable elimination [4]. 


3 Definition Detection 


Given a QBF with no additional information, we first detect definitions to determine 
which variables can be moved. All definitions are detected before variable movement 
begins. Variable movement depends on the defining clauses, the definition variables, and 
the nesting of definition variables. At a minimum, definition detection must produce the 
defining clauses, and the rest can be inferred during movement. 

Since the seminal work by Eén and Biere [5], bounded variable elimination (B VE) 
has been an essential preprocessing technique in SAT solving. The technique relies on 
definitions, so most SAT solvers incorporate some form of definition detection. The 
conflict-driven clause learning SAT solver KISSAT [1] extends the commonly used syn- 
tactic pattern matching with semantic definition detection. The detection is applied to 
variables independently. Alternatively, the preprocessor CNFTOOLS [10] performs hier- 
archical definition detection, capturing additional information about definition variable 
nesting and monotonic definitions. 

These tools run on CNF formulas. A QBF can be transformed into a CNF by remov- 
ing the prefix, but not all definitions in the CNF are valid w.r.t. the prefix. For example, 
some definitions will not be left-total because of the quantifier level restrictions in the 
Q-blocked property. Such definitions can be easily filtered out before variable move- 
ment, so there is no need to add these quantifier-based checks into the tools. 


3.1 Hierarchical Definition Detection in CNFTOOLS 


The hierarchical definition detection in CNFTOOLS employs a breadth first search (BFS) 
to recurse through nested definitions in a formula. Root clauses are selected heuris- 
tically, then BFS begins on the variables occurring in those clauses. All unit clauses 
are selected as root clauses. The max-var heuristic selects root variables based on their 
numbering. This exploits the practice of numbering definition variables after problem 
variables. The more involved min-unblocked heuristic finds a minimally unblocked lit- 
eral. This is more expensive to compute but does not rely on variable numbering. 

When a variable is encountered in the BFS, CNFTOOLS checks if the defining 
clauses are blocked. If so, the following detection methods are applied: pattern match- 
ing for BiEQ, AND, OR, and full patterns, monotonic checking, and semantic checking. 
BiEQ refers to an equivalence between two variables. 
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A definition is a full pattern if VC € d(x) U 6(%), |C| = n + 1 where n is the 
number of defining variables and there are 2” defining clauses. The full pattern includes 
some common encodings for XOR, XNOR, NOT, and Majority3, but is often avoided. 
Since the detection follows the hierarchical nesting of definitions, there is no ambiguity 
between the defining variables and definition variables in XOR definitions. 

The advantage of hierarchical detection is the ability to detect monotonic defini- 
tions. For variable movement we consider only monotonic definitions that are either 
fully-defined or one-sided. If a monotonic definition is not fully-defined but the defini- 
tion variable occurs positively and negatively in the defining clauses of other definitions, 
the additional clauses can prevent variable movement w.r.t. the QRAT proof system. 

Semantic checking involves solving the SAT problem for right uniquness described 
in the preliminaries. As definitions are detected the defining clauses are removed from 
the formula for the following iterations. This can produce problematic one-sided defini- 
tions. For example, a variable may occur both positively and negatively in the defining 
clauses of other definitions, and removing those clauses makes the variable one-sided. 
Similar to the monotonic case, the additional defining clauses can prevent movement 
w.r.t. the QRAT proof system, so these types of definitions must be filtered out. 


3.2 Independent Definition Detection in KISSAT 


KISSAT uses definition detection to find candidates for BVE. Starting with the 2021 
SAT Competition, KISSAT added semantic definition detection [6] to complement the 
existing syntactic pattern matching for BiEQ, AND, OR, ITE, and XOR definitions. In 
semantic detection an internal SAT solver KITTEN with low overhead and limited capa- 
bilities performs a right-uniqueness check on the formula y(a) U Y(T) after removing 
all occurrences of x and 7. This formula includes p(x) and p(z) as the set of defining 
clauses are not known in advance. If the formula is unsatisfiable, an unsatisfiable core 
is extracted (potentially after reduction) and returned as the set of defining clauses. 

Core extraction does not guarantee the defining clauses are blocked. Internally 
KISSAT generates resolvents over the defining clauses for BVE. We modify KISSAT to 
only detect semantic definition where zero resolvents are generated, ensuring the defin- 
ing clauses are blocked. We ignore built-in heuristics for selecting candidate variables 
and instead iterate over all variables. 

No nesting information is gathered during definition detection in KISSAT. If a vari- 
able is a part of an XOR definition, KISSAT cannot determine if the variable is a defining 
variable or the definition variable. The defining variables for an XOR may themselves 
be defined by another definition in the formula. To check for this, if a variable was 
detected as part of an XOR or semantic definition, the definition clauses were set to 
inactive and the detection procedure was rerun for that variable. 


4 Moving Variables 


After all definitions are detected, we move definition variables as close to their defining 
variables as possible to maximize universal reduction. To do this, we introduce empty 
existential quantifier levels, denoted T;, following each Q;X; in the prefix yielding 
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Q1X15T, Q2 X 23T -< - Qn—-1Xn—15Tn—1Qn Xn. There is no T, because variables are 
not moved inwards. For each definition variable x that can be moved, a fresh variable 
x’ is placed in the quantifier level Tm for m = max{(z) | z € Zz}. That is, x’ will be 
placed in the existential block that immediately follows the innermost defining variable. 
Finally, x will be removed from the prefix, and the new formula will be ~[2’ /a]. 


Example 2. Inthe formula Jx3Yxı Jx4Vx23z5.(£5 V4 V F3) A (T5 V U3) A (T5 V v4) A 
(a3 V1) A(a2V a5), the variable x5 is defined as £5 © x3/A 24, with defining variables 
{3,24}. A fresh variable xf is introduced to replace x5. x; is placed in an existential 
quantifier level following the innermost defining variable x4. Then, x, is substituted for 
xs in the formula giving 3z3Yxı13Jxz43xf$Yr2. (x$ V T4 V T3) A (T5 V £3) A (T; V v4) A 
(x5 Vz1)A(z2V z$). Finally, £2 can be removed from (22 V x; ) by universal reduction. 


Movement requires new variables because QRAT steps either add or delete clauses 
and cannot affect the quantifier placement of existing variables. When definitions are 
added in the checker QRAT-TRIM the new definition variables are placed in a quantifier 
level based on their defining variables. For a definition variable x, if the innermost 
defining variable z € X; is existentially quantified (Q; = 3) the definition variable is 
placed in X;, and if z is universally quantified (Q; = V) the definition variable is placed 
in the existential level X;}1, So, new definition variables are placed in the desired 
quantifier level. Because contiguous levels with the same quantifier can be combined, 
the introduction of T levels does not change the semantics. 


4.1 Moving in Order 


The tools for definition detection run on CNF instances, so, some definitions may not 
be left-total when considering the prefix. This can occur if the definition variable is in 
a level outer to one of its defining variables. Also, some monotonic definitions may not 
satisfy the one-sided property. These problems are checked during proof generation. If 
they occur, that variable is not moved. 

The variable movement algorithm starts at the outermost quantifier level and sweeps 
inwards, at each step moving all possible definition variables to the current level. A 
definition variable x can be moved if x >r z forall z € Zy, and x is not universally 
quantified. It can be moved to Tm where m = max{X(z) | z € Zz}, and will be 
moved during iteration m of the algorithm. A look up table is used to efficiently find 
definitions with the innermost defining variable at level m. Once a definition variable 
has been moved, if it was a defining variable for some other definitions, those definitions 
are checked for movement and the look up table is updated. Since the iteration starts at 
the outermost level, it guarantees variables that can be moved within our framework are 
moved as far as possible. This requires a single pass, so moved definitions will not be 
revisited. 


4.2 XOR Processing 


In an XOR definition multiple variables are left-total and right-unique. Additional infor- 
mation is required to determine which variable is the proper candidate for movement. 
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If a variable is defined elsewhere and appears in an XOR, it must be a defining variable 
in the XOR. In addition, universal variables must be defining variables. However, a dis- 
tinction cannot be made between the remaining variables before beginning movement. 


Example 3. Given the QBF, 41.21, £2Vy1J2£3Vy233x£4Vy3J4z5Vy4J5 £6, £7.(£6 O V1 
A zi) A (£3 8X4 @ z5) A (a1 OX5 ® z6) A..., determining the definition variables 
for the XOR definitions will hinge on the movement of xg. Case 1, Let x; = x7 in the 
AND definition, xg cannot be moved. Then, x; can be moved to Jz as the definition 
variable of (23 ®x4@ 25). No other variables can be moved. Case 2, Let x; = x2 in the 
AND definitions, xg can be moved to 34. Then, x; can be moved to J; as the definition 
variable of (x1 ® x5 ® xe). Next, x4 can be moved to J> as the definition variable of 
(x3 ® £4 ® x5). The possible movement of xe will determine how the XOR definitions 
are moved. This information is not known until runtime, so the definition variable of an 
XOR cannot be determined before variable movement is performed. 


As seen in the example, movement of definition variables can affect what variable in 
an XOR is eventually moved. The definition variable for an XOR must be determined 
during the movement process. The definition variable is initially set as the innermost 
variable in the XOR. If that variable is defined elsewhere and moved, the definition 
variable of the XOR is reset to the new innermost variable. We perform the same check 
as the general case to see if the definition variable can be moved. With XOR definitions, 
the algorithm is still deterministic and produces optimal movement, since all variables 
that can be moved are moved to their outermost level. 


4.3 Proving Variable Movement 


In this section we describe how to modify a formula through a series of QRAT clause 
additions and deletions to achieve variable movement. Moving a definition variable x 
in the formula JT. involves: 


— Introducing a new definition variable x’ to replace x. 

— Deriving an equivalence between x’ and x. 

— Transforming the formula ~ to w[2’/«] with x removed from J and x’ placed in 
the existential quantifier level following its innermost defining variable. 


The algorithm for moving a definition variable x proceeds in five steps, each involv- 
ing some clause additions or deletions. Some of the steps can be simplified depending 
on the type of definition. Moving a one-sided definition requires slight modifications to 
a few steps, and these are discussed following each of the relevant steps. 


1. Add the defining clauses 6(x') and 6("). 
We introduce a fresh existential variable x’ and add the defining clauses 6(x)[x’ /] 
and 6(z)[x’ /a]. Each clause is Q-blocked on x’ or 7’ since the definition is left-total 
and variable x’ is in the quantifier level following its innermost defining variable. 
2. Add the equivalence clauses x © x’. 
Both x and x’ are fully defined by the same set of variables, so it is possible to 
derive the equivalence clauses (T V x’) and (x V 2’). The first implication added 


Moving Definition Variables in Quantified Boolean Formulas 469 


is Q-blocked-subsumed. Consider (% V x’), for each clause C” € 5(%’). The outer 
resolvent of C” with (TV x’) on x is subsumed by the corresponding C € 5(%). This 
is not the case for (x V Z’) because the outer resolvent of (x V Z’) with (Z V 2’) is 
not subsumed by the formula. The clause (x V Z’) is QRAT for certain definitions, 
in particular AND/OR. In the general case we generate a chain of Q-resolutions 
that imply (x V 7’). We use defining variable elimination to eliminate Z, from the 
formula ô(x)US(T'). The procedure produces the clause (xVz"). The resolution tree 
rooted at (x V7’) is traversed in post-order giving the list of clauses C1, ..., Cn, (£ V 
z’). We add the clauses in order, deriving (x V Z’). The clauses are subsumed by 
(x V 7’) and deleted. If defining variable elimination does not produce (x V 2’), 
then the definition is not right-unique. The variable x cannot be moved in this case. 
ONE-SIDED: assuming for the one-sided definition that x occurs positively in the 
defining clauses, the implication (%’ V x) is added. The implication is Q-blocked- 
subsumed for the same reasons as the first implication above. If x occurs negatively 
the implication (© V x’) is added. We will continue the remaining steps under the 
assumption that x occurs positively in the defining clauses for the one-sided case. 

. Add and remove the remaining clauses p(x) and p(T). 
For all clauses C € p(x), C’ € p(x’) is the Q-resolvent of C with (& V x’) on 
pivot x, so C” can be added. C can be deleted because it is the Q-resolvent of C” 
with (z’ V x) on pivot x’. Similar reasoning is used for C € p(T). 
ONE-SIDED: All C” € p(x’) are added with the same reasoning as above. However, 
there is no (T V x’) so C € p(T) cannot be deleted until step 5. 

. Remove the equivalence clauses x + x' 
Equivalence clauses (x V 7’), (ZV x’) are deleted. (x V Z’) is Q-blocked-subsumed 
on variable x since for all D € (T), the outer resolvent of (x V Z’) and D is 
subsumed by the defining clause D’ € 6(Z’), and the outer resolvent of (x V 7’) 
with (Z V x’) is a tautology. Similarly, (Z V x’) is Q-blocked-subsumed. 
ONE-SIDED: the definition clauses need the implication in order to be deleted, and 
so deletion is deferred to step 5. 

. Remove the defining clauses (x) and 6(%). 
The defining clauses on x are all Q-blocked and are deleted. 
ONE-SIDED: The defining clauses D € 6(a) can be deleted because they are Q- 
resolvents of D’ € 6(a’) with (Z’ V x) on x’. Now the clauses (z’ V x) and p(z) 
are Q-blocked on x because x only occurs negatively. They are deleted. 


Given the QBF J7.w, applying the transformation sequentially with definition vari- 


ables 11,...,2p, Will yield the QBF 7.x)’ where all definition variables x; have been 


replaced by new variables «i 


and the new variables are in the appropriate quantifier 


levels. The concatenated series of clause additions and deletions generated for each 
definition variable gives a QRAT proof of the equivalence between H.y and IT. 

The steps above can also be used to move a definition variable to some existential 
quantifier between the variable and its innermost defining variable. In addition, a def- 
inition variable that is inside its defining variables can be moved further inwards by 
reversing the steps, but it is not clear when this would be useful. 


Example 4. Given the QBF 4 


£1Vz2.(£1 V T2) A (Z1 V x2), we have the definition 


X1 + x2. The definition is right-unique but the defining clauses are not Q-blocked on 
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xı since zı is at an outer quantifier level. The QBF is false but moving xı inward would 
make it true. To avoid this, we only move variables outward. 


Example 5. Given the definition x; 6x2 ® x3 with xı as the definition variable we have 
lx) = {(x1 V To V x3), (xı VTə2 VT3)} and 6(Z') = {(Z V 2&2 V a3), (zi V X92 VT3)}. 
Defining variable elimination will perform the following steps: 


Eliminate x2 :{ (£1 V£2V £3) Qz (T1 V T2 V £3), (T1 V £2 V T3) @x, (£1 V Fo V Z3) } 
6, = {(x1 V T1 V #3), (x1 V T1 V E3)} 

Eliminate x3 :4 (x1 V T1 V £3) Qz; (£1 V T1 V T3)} 
92 = { (x1 V T1)} 


The clause additions to derive the second implication in step 2 would be (xı V 7} V 
x3), (v1 V ©, V Z3), (xı V 71). Each subsequent clause in the list is implied by Q- 
resolution. With more defining variables, the resolution tree becomes more complex. 
The derivation will be of the form 0{,...,0/,_, for 0; C 0; where 6% will include only 
the clauses needed to derive (xı V 71). These can be determined by working through 
the resolution chain backwards from (x, V 74). 


Example 6. Given the formula 421 2203Vu5xg3av4(a1 V £2 V £3 V 4) A (T1 V T4) A 
(T2 V T4) A (T3 V T4) A (z4 V T5) A (z4 V Fe), we show the steps generating the QRAT 
proof of movement for variable x4 with the pivot appearing as the first literal in the 
clause. Clauses following a d are deleted from the formula. 


1. (a, V £1 V £2 V z3), (Z4 V z1), (Z1 V Z2), (Z4 V Fs) 

2. (T4 V z4), (£4 V z4) 

3. (a), V Z5), d(x4 V Bs), (a, V Ze), d(x4 V Ze) 

4. d(x4 V T4), d(T4 V x) 

5. d(x4 V z1 V T2 V £3), d(T14 V T1), d(T4 V T2), d(T4 V T3) 


The definition variable x4 is replaced by the fresh variable x4 which will be placed 
in the prenex as 4x1 292352/,V25x6 achieving the desired movement. The QRAT proof 
system uses a stronger redundancy notion that avoids auxiliary clauses for an AND 
definition in step 2. 


We verified all instances of variable movement on QBFEVAL’ 20 benchmarks using 
QRAT-TRIM [8]. By default, QRAT-TRIM will check a satisfaction proof with forward 
checking, verifying the clause deletion steps are correct in the order they appear. A 
refutation proof is checked with backward checking, verifying the clause addition steps 
are correct starting at the empty clause and working backwards. It is not known whether 
the problem is true or false at the variable movement stage, so both clause addition and 
deletion steps are checked to preserve equivalence. To do this, we modified QRAT- 
TRIM by adding a DUAL-FORWARD mode that performs a forward check, verifying 
both clause additions and deletions. We verified several end-to-end proofs for formulas 
solved by BLOQQER after variable movement. We appended the BLOQQER proof onto 
the variable movement proof, and verified it against the original formula with QRAT- 
TRIM. All formulas that BLOQQER solved after movement were verified in this way. 
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5 Evaluation 


Variable movement is evaluated on 494 of the 521 QBFEVAL’20 benchmarks. Two 
benchmark families were removed due to resource limits preventing proof verification. 
We compare definition detection tools KISSAT and CNFTOOLS, then evaluate the affect 
of variable movement on solver performance. We ran our experiments on StarExec [18]. 
The compute nodes that ran our experiments were Intel Xeon E5 cores with 2.4 GHz, 
and all experiments ran with 32 GB. The repository with programs and data is archived 
at https://zenodo.org/record/5733440. 


5.1 Evaluating Definition Detection 


The tools are given 10 seconds to detect definitions. KISSAT attempts to check each 
variable, whereas CNFTOOLS will iterate through root clauses until the time limit. Root 
clause selection is split into max-var (mv) and minimally-unblocked (mb). We consider 
all definitions extracted up to a timeout if one occurs. The combined approach takes the 
union of definitions found in each tool, and each tool is still allotted 10 seconds. 

Figure 1 shows the number of definitions found (top) and moved (bottom) com- 
pared to the combined approach. The tools do not go above the diagonal in either plot 
because the combined approach takes a union of found definitions and movement can- 
not be worsened by additional definitions. For many formulas multiple tools contribute 
to the combined total, shown by a column of points where none are on the diagonal. 
There is a noticeable pattern between CNFTOOLS (mb) and (mv) where (mb) performs 
slightly worse due to the additional time spent computing the minimally-unblocked root 
clauses. But there are some instances where the minimally-unblocked heuristic finds 
definitions that lead to more movement. For combined, definitions were found in 493 
instances and moved in 157 instances In comparing the plots it is clear that the num- 
ber of definitions found is not a strict predictor of movement. KISSAT finds a similar 
number of definitions as CNFTOOLS for many instances but consistently moves more. 
Table 1 shows the breakdown of definitions found and moved by type, and the AND/OR 
found more frequently by KISSAT are moved more often. 

Table | further illuminates the differences between the tools. CNFTOOLS has syntac- 
tic definition detection similar to KISSAT for BiEQ, AND/OR, XOR, but fails to move 
a fraction of the XOR definitions. CNFTOOLS does detect tens of XORs as monotonic 
definitions with the wrong definition variable, meaning the BFS picked up nested def- 
initions in the wrong direction w.r.t. quantifier levels. But, the reason for the large gap 
between CNFTOOLS and KISSAT is efficiency. CNFTOOLS does not detect the vast ma- 
jority of XOR definitions moved by KISSAT within the time limit, and the same is true 
for the other definitions. KISSAT uses the entire 10 seconds on 11 formulas whereas 
CNFTOOLS times out on 111 (mv) and 99 (mb). Increasing the timeout for each tool 
in the combined approach to 50 seconds produces only 780 more moved variables over 
2 formulas. It is clear from the bottom plot in Figure | that CNFTOOLS contributes to 
the movement of the combined approach in a handful of cases where KISSAT is not on 
the diagonal. Combining the output of the tools makes use of KISSAT’s speed in detect- 
ing many simple definitions and CNFTOOLS’s ability to find one-sided definitions using 
complex heuristics and hierarchical search. 
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Fig. 1. Comparison of definitions found (top) and moved (bottom) per instance between combined 
and the individual tools. 


No variables found by semantic detection were moved in KISSAT and only 88 were 
moved in CNFTOOLS (mb). KISSAT found 159,544 right-unique definitions with KIT- 
TEN, but only 23,457 were left-total. Of those, the majority had defining variables in the 
same level as the definition variable, and a smaller fraction had the definition variable 
at an outer level. For CNFTOOLS 48,715 (mb) and 147,170 (mv) semantic definitions 
were detected via. right-uniqueness checks. These semantic definitions may not be in- 
troduced or manipulated by users in the same way as the standard definitions, explaining 
why they already occur in the desired quantifier level. 


The far most common reason definitions cannot be moved is that they already appear 
in the same quantifier level as some of their defining variables. For example, many 
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Table 1. The number of definitions found and moved over all instances. Definitions moved are 
broken down by a selection of the types, omitting ITE and semantic. Some one-sided definitions 
CNFTOOLS moves are fully-defined, and combined will move them based on the fully-defined 
definition provided by KISSAT. So, the missing one-sided definitions for combined are spread 
across the other definition types. 


Detection Tool Found Moved BiEQ AND/OR One-Sided XOR 


CNFTOOLS(mv) 3,525,559 1,032,807 21,198 969,630 37,642 0 
CNFTOOLS(mb) 2,856,306 935,336 4,619 891,027 39,863 0 
KISSAT 9,243,158 1,567,746 308,987 1,215,036 — 42,364 


combined 9,624,654 1,664,655 309,793 1,273,381 37,646 42,476 


Table 2. The number of definitions found that were not left-total, split by existentially and uni- 
versally quantified variables, along with monotonic definitions that could not be moved because 
they were not one-sided. If any universally quantified variable was left-total, the formula would 
be trivially false. 


Detection Tool Existential Universal One-sided 


CNFTOOLS(mv) 43,278 11,360 1,107 
CNFTOOLS(mb) 23,690 3,771 1,421 


KISSAT 32,681 3,219 — 


formulas have only two quantifier levels, so there would be no possible movement with 
all existential variables in the same level. Table 2 shows other reasons a variable may 
not be moved. A definition is not left-total when the definition variable is at a level 
outer to some of its defining variables. The tools detected several of these definitions 
on both universally and existentially quantified variables. Example 2 shows why these 
variables cannot be moved inwards. Additionally, some of the monotonic definitions 
extracted by CNFTOOLS are neither fully-defined nor one-sided. These checks are not 
made until a variable becomes a candidate for movement because a large fraction will 
be preemptively filtered out due to their quantifier level placement. 

CNFTOOLS detect 2,038,407 (mv) and 1,897,482 (mb) monotonic definitions, but 
this does not match the number of one-sided definitions moved. The majority of mono- 
tonic definitions found and moved are actually fully defined. This means for many of 
the definitions, either ô(x) or (7) can be removed from the QBF while preserving 
equivalence. This can be done in QRAT by recursing through the monotonic definitions 
and deleting the redundant defining clauses. The large number of fully-defined mono- 
tonic definitions shows that QBF formulas generally do not take advantage of optimized 
encodings, such as the Plaisted-Greenbaum transformation. 


5.2 Evaluating Solvers 
We used the following solvers to evaluate the impact of variable movement. 


— RAREQS (Recursive Abstraction Refinement QBF Solver) [11] pioneered the use 
of counterexample guided abstraction refinement (CEGAR)-driven recursion and 
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learning in QBF solvers. The 2012 version has comparable performance to current 
top-tier solvers. 

CAQE (Clausal Abstraction for Quantifier Elimination) [17] is the first place winner 
of the 2017, 2018, and 2020 competitions. The solver is written in RUST and based 
on the CEGAR clausal abstraction algorithm. 

DEPQBF implements the adapted DPLL algorithm QDPLL, relying on depen- 
dency schemes to select independent variables for decision making [14]. DEPQBF 
incorporates QBCE [2] as inprocessing which complicates its relation to prepro- 
cessors like BLOQQER. 

GHOSTQ is a non-clausal QBF solver [13]. The solver attempts to convert CNF 
or QCIR to the GHOSTQ format which introduces Ghost variables, the dual of 
Tseitin variables. The structural information gained by the conversion is important 
to GHOSTQ’s performance. The conversion relies on the discovery of definitions, 
which is significantly hampered by preprocessors that delete or change clauses. 
GHOSTQ also supports a CEGAR extension. 


Table 3 shows that variable movement always improves solver performance with 
and without BLOQQER. Figure 2 provides a more detailed view of the QBF solvers’ 
performance on the original (-o) and moved (-m) formulas using the combined defi- 
nition detection. The times include definition detection and proof generation, adding 
50 seconds on average. In moved formulas, adjacent quantifier levels of the same type 
were conjoined into a single quantifier level because of GHOSTQ’s internal definition 
detection. This did not impact the other solvers. Movement significantly improves per- 
formance of CAQE, DEPQBF, and GHOSTQ-p (plain mode). GHOSTQ-ce (CEGAR 
mode) and RAREQS improve slightly with movement. Since both GHOSTQ modes 
use the same conversion to the GHOSTQ format, the impact of variable movement on 
the conversion does not explain the difference in performance. . Separate experiments 
moving all definitions except XORs did improve the performance of GHOSTQ in both 
modes while not affecting other solvers. This is because the conversion to the GHOSTQ 
format only checks the innermost quantifier level for XOR definitions, and cannot find 
them if they have been moved. The three solvers implementing CEGAR, GHOSTQ- 
ce, RAREQS, and CAQE, were affected differently by movement. This may be due to 
internal heuristics. 

Most state-of-the-art QBF solvers make use of preprocessors. The exception is 
GHOSTQ because its definition detection suffers after the application of QBCE. Fig- 


Table 3. The number of instances solved within the 5,000 time-limit over benchmarks where 
variable movement was possible. 


Solver Original Moved BLOQQER Moved-BLOQQER 


CAQE 74 84 99 103 
GHOSTQ(p) 55 61 47 52 
GHOSTQ(ce) FI 80 65 70 
RAREQS 72 72 94 98 


DEPQBF 64 70 64 71 
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Fig. 2. Cumulative number of solved instances considering only the 157 benchmarks which had 
variables that could be moved. 
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Fig. 3. Cumulative number of solved instances after applying BLOQQER for 100 seconds consid- 
ering only the 157 benchmarks with movement. 


ure 3 shows solver performance with moving variables before applying BLOQQER (m- 
b) and only applying BLOQQER (-b). The solving time includes the variable movement 
and BLOQQER runtime within a 100 second timeout. After moving variables, BLO- 
QQER solved 3 formulas and those data are reflected in the plot. In addition, each 
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of the 14 formulas BLOQQER solved before movement, BLOQQER also solved after 
movement. Performance improved for all solvers when applying variable movement 
before BLOQQER. One reason for this is movement may allow for more applications 
of universal reduction. We also experimented with moving variables after BLOQQER 
preprocessed the formulas. Few variables were moved, and it did not affect solver per- 
formance. This is likely due to QBCE removing defining clauses from the formula. 


6 PGBDDQ Case Study 


Two player games can be succinctly represented in QBF, as an existential player versus 
a universal opponent. Problem variables encode moves alternating between quantifier 
levels, and definition variables encode the game state as moves are played over time. 
Given a 1 x N board, the linear domino placement game has two players alternately 
placing 1 x 2 dominos on the board. The first player who cannot place a domino loses. 
The game can be encoded with around N? /2 problem and 3N?/2 definition variables. 

PGBDDQ is a BDD-based, proof-generating QBF solver. [3] It starts at the inner- 
most quantifier level and performs bucket elimination, linearizing variables and elim- 
inating them through a series of BDD operations that are equivalence-preserving. As 
BDDs are manipulated, PGBDDQ generates a dual proof through a series of clause 
additions and deletions. PGBDDQ can solve the linear domino placement problem 
with polynomial performance when definitions are placed in carefully selected quan- 
tifier levels after their defining variables (Manual). In this configuration, moves are 
processed from the last to the first, with the BDDs at each quantifier level effectively 
encoding the outcomes of the possible end games for each board state. The performance 
deteriorates when definition variables are placed in the innermost quantifier level (End). 


LDomino with Varying Definition Variable Placement 


T | H 


—A— Manual 
—— Move 
—e— End 


i | | | l i 
0 1,000 2,000 3,000 4,000 5,000 
CPU time 


Fig. 4. Performance on boards of size N for false formulas where player two wins. The Move 
placement times out at N = 30 and the End placement runs out of memory at N = 14. 
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In this configuration, the BDDs at each quantifier level must encode the outcomes of 
the possible end games in terms of the history of all moves up to that point in the game. 

Figure 6 shows the performance of PGBDDQ on false formulas where the second 
player will win. In each configuration, the same hand-crafted BDD variable ordering 
was used. With the End encoding PGBDDQ runs out of memory on 32 GB RAM 
at N = 12. Applying our movement algorithm to this encoding (Move), the solver 
performs significantly better and solves all formulas up to N = 30 before timeouts 
occur. This shows how the general problem of memory inefficiency within a BDD can 
be eased by moving definition variables across quantifier levels. The gap in performance 
between the Move placement and the Manual placement may be due to the ordering 
of variables within a quantifier block or moving variables too far outward. When a 
variable is moved it can be placed anywhere within a quantifier level as this does not 
change semantics. Also, variables do not need to be moved all the way to their innermost 
defining variable. Exploring these options in the context of a structurally dependent 
solver PGBDDQ may lead to improvements that affect other QBF solvers. 


7 Conclusion and Future Work 


We presented a technique for moving definition variables in QBFs. The movement can 
be verified within the QRAT proof system, and we validated all proofs in the evaluation 
with QRAT-TRIM. Using the tools KISSAT and CNFTOOLS to detect definitions, we 
created a tool-chain for variable movement. On the QBFEVAL’20 benchmarks, one 
quarter of formulas had definitions that could be moved, and the movement increased 
solver performance. In addition, we found that movement followed by BLOQQER was 
more effective than preprocessing with BLOQQER. 

For future work, incorporating quantifier level information into definition detection 
could reduce the costs. For example, the hierarchical detection could recurse outwards 
based on quantifier levels, reducing the number of root clauses explored and reducing 
the number of unmoveable definitions detected. Additionaly, there are ways to expand 
on variable movement. It is possible to place variables anywhere within a given quan- 
tifier level and also to adjust how far variables are moved. Optimizing movement may 
require understanding how variable movement impacts each solver’s internal heuristics 
and solving algorithm. Separately, monotonic definitions that are not one-sided present 
an interesting challenge for variable movement, as they occur in both polarities outside 
of the definition. It might also be possible to move the approximately 160,000 semantic 
definitions found be KITTEN that were right-unique but not left-total. 
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Abstract. Ina previous paper, we have shown that clause sets belonging to the Horn 
Bernays-Schonfinkel fragment over simple linear real arithmetic (HBS(SLR)) can be 
translated into HBS clause sets over a finite set of first-order constants. The translation 
preserves validity and satisfiability and it is still applicable if we extend our input with pos- 
itive universally or existentially quantified verification conditions (conjectures). We call 
this translation a Datalog hammer. The combination of its implementation in SPASS-SPL 
with the Datalog reasoner VLog establishes an effective way of deciding verification con- 
ditions in the Horn fragment. We verify supervisor code for two examples: a lane change 
assistant in a car and an electronic control unit of a supercharged combustion engine. 
In this paper, we improve our Datalog hammer in several ways: we generalize it to mixed 
real-integer arithmetic and finite first-order sorts; we extend the class of acceptable 
inequalities beyond variable bounds and positively grounded inequalities; and we 
significantly reduce the size of the hammer output by a soft typing discipline. We call 
the result the sorted Datalog hammer. It not only allows us to handle more complex 
supervisor code and to model already considered supervisor code more concisely, but it 
also improves our performance on real world benchmark examples. Finally, we replace 
the before file-based interface between SPASS-SPL and VLog by a close coupling 
resulting in a single executable binary. 


1 Introduction 


Modern dynamic dependable systems (e.g., autonomous driving) continuously update software 
components to fix bugs and to introduce new features. However, the safety requirement of such 
systems demands software to be safety certified before it can be used, which is typically a lengthy 
process that hinders the dynamic update of software. We adapt the continuous certification 
approach [17] for variants of safety critical software components using a supervisor that 
guarantees important aspects through challenging, see Fig. 1. Specifically, multiple processing 
units run in parallel — certified and updated not-certified variants that produce output as 
suggestions and explications. The supervisor compares the behavior of variants and analyses 
their explications. The supervisor itself consists of a rather small set of rules that can be 
automatically verified and run by a reasoner such as SPASS-SPL. In this paper we concentrate 
on the further development of our verification approach through the sorted Datalog hammer. 


© The Author(s) 2022 
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Fig. 1. The supervisor architecture. 


While supervisor safety conditions formalized as existentially quantified properties can 
often already be automatically verified, conjectures about invariants requiring universally 
quantified properties are a further challenge. Analogous to the Sledgehammer project [8] of 
Isabelle [31] that translates higher-order logic conjectures to first-order logic (modulo theories) 
conjectures, our sorted Datalog hammer translates first-order Horn logic modulo arithmetic 
conjectures into pure Datalog programs, which is equivalent to the Horn Bernays-Schönfinkel 
clause fragment, called HBS. 


More concretely, the underlying logic for both formalizing supervisor behavior and for- 
mulating conjectures is the hierarchic combination of the Horn Bernays-Schönfinkel fragment 
with linear arithmetic, HBS(LA), also called Superlog for Supervisor Effective Reasoning 
Logics [17]. Satisfiability of BS(LA) clause sets is undecidable [15,23], in general, however, 
the restriction to simple linear arithmetic BS(SLA) yields a decidable fragment [19,22]. 


Inspired by the test point method for quantifier elimination in arithmetic [27] we show 
that instantiation with a finite number of values is sufficient to decide whether a universal 
or existential conjecture is a consequence of a BS(SLA) clause set. 


In this paper, we improve our Datalog hammer [11] for HBS(SLA) in three directions. 
First, we modify our Datalog hammer so it also accepts other sorts for variables besides 
reals: the integers and arbitrarily many finite first-order sorts F;. Each non-arithmetic sort 
has a predefined finite domain corresponding to a set of constants F; for F; in our signature. 
Second, we modify our Datalog hammer so it also accepts more general inequalities than 
simple linear arithmetic allows (but only under certain conditions). In [11], we have already 
started in this direction by extending the input logic from pure HBS(SLA) to pure positively 
grounded HBS(SLA). Here we establish a soft typing discipline by efficiently approximating 
potential values occurring at predicate argument positions of all derivable facts. Third, we 
modify the test-point scheme that is the basis of our Datalog hammer so it can exploit the 
fact that not all all inequalities are connected to all predicate argument positions. 

Our modifications have three major advantages: first of all, they allow us to express super- 
visor code for our previous use cases more elegantly and without any additional preprocessing. 
Second of all, they allow us to formalize supervisor code that would have been out of scope 
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of the logic before. Finally, they reduce the number of required test points, which leads to 
smaller transformed formulas that can be solved in much less time. 

For our experiments of the test point approach we consider again two case studies. First, 
verification conditions for a supervisor taking care of multiple software variants of a lane 
change assistant. Second, verification conditions for a supervisor of a supercharged combustion 
engine, also called an ECU for Electronical Control Unit. The supervisors in both cases are 
formulated by BS(SLA) Horn clauses. Via our test point technique they are translated together 
with the verification conditions to Datalog [1] (HBS). The translation is implemented in our 
Superlog reasoner SPASS-SPL. The resulting Datalog clause set is eventually explored by the 
Datalog engine VLog [13]. This hammer constitutes a decision procedure for both universal 
and existential conjectures. The results of our experiments show that we can verify non-trivial 
existential and universal conjectures in the range of seconds while state-of-the-art solvers 
cannot solve all problems in reasonable time, see Section 4. 


Related Work: Reasoning about BS(LA) clause sets is supported by SMT (Satisfiability 
Modulo Theories) [30,29]. In general, SMT comprises the combination of a number of theories 
beyond LA such as arrays, lists, strings, or bit vectors. While SMT is a decision procedure for the 
BS(LA) ground case, universally quantified variables can be considered by instantiation [36]. 
Reasoning by instantiation does result in a refutationally complete procedure for BS (SLA), but 
not in a decision procedure. The Horn fragment HBS(LA) out of BS(LA) is receiving addi- 
tional attention [20,7], because it is well-suited for software analysis and verification. Research 
in this direction also goes beyond the theory of LA and considers minimal model semantics 
in addition, but is restricted to existential conjectures. Other research focuses on universal 
conjectures, but over non-arithmetic theories, e.g., invariant checking for array-based sys- 
tems [14] or considers abstract decidability criteria incomparable with the HBS(LA) class [34]. 
Hierarchic superposition [3] and Simple Clause Learning over Theories (SCL(T)) [12] are both 
refutationally complete for BS(LA). While SCL(T) can be immediately turned into a decision 
procedure for even larger fragments than BS(SLA) [12], hierarchic superposition needs to be 
refined to become a decision procedure already because of the Bernays-Schonfinkel part [21]. 
Our Datalog hammer translates HBS(SLA) clause sets with both existential and universal 
conjectures into HBS clause sets which are also subject to first-order theorem proving. Instance 
generating approaches such as iProver [25] are a decision procedure for this fragment, whereas 
superposition-based [3] first-order provers such as E [38], SPASS [40], Vampire [37], have addi- 
tional mechanisms implemented to decide HBS. In our experiments, Section 4, we will discuss 
the differences between all these approaches on a number of benchmark examples in more detail. 

The paper is organized as follows: after a section on preliminaries, Section 2, we present 
the theory of our sorted Datalog hammer in Section 3, followed by experiments on real world 
supervisor verification conditions, Section 4. The paper ends with a discussion of the obtained 
results and directions for future work, Section 5. The artifact (including binaries of our tools 
and all benchmark problems) is available at [9]. An extended version is available at [10] 
including proofs and pseudo-code algorithms for the presented results. 


2 Preliminaries 


We briefly recall the basic logical formalisms and notations we build upon [11]. Starting point 
is a standard many-sorted first-order language for BS with constants (denoted a,b,c), without 
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non-constant function symbols, variables (denoted w,x,y,z), and predicates (denoted P,Q,R) 
of some fixed arity. Terms (denoted t,s) are variables or constants. We write x for a vector of 
variables, a for a vector of constants, and so on. An atom (denoted A,B) is an expression P(f) 
for a predicate P of arity n and a term list f of length n. A positive literal is an atom A and 
a negative literal is a negated atom ~A. We define comp(A) =A, comp(7A) =A, |A|=A 
and |=A|=A. Literals are usually denoted L,K,H. 

A clause is a disjunction of literals, where all variables are assumed to be universally 
quantified. C,D denote clauses, and N denotes a clause set. We write atoms(X) for the set 
of atoms in a clause or clause set X. A clause is Horn if it contains at most one positive literal, 
and a unit clause if it has exactly one literal. A clause A, V...VA,V—7B,V...V—7B,, can be 
written as an implication By A...ABy— A1 V... V An, still omitting universal quantifiers. If 
Y is aterm, formula, or a set thereof, vars(Y) denotes the set of all variables in Y, and Y is 
ground if vars(Y) =@. A fact is a ground unit clause with a positive literal. 


Datalog and the Horn Bernays-Schonfinkel Fragment: The Horn case of the Bernays- 
Schonfinkel fragment (HBS) comprises all sets of clauses with at most one positive literal. The 
more general Bernays-Schonfinkel fragment (BS) in first-order logic allows arbitrary formulas 
over atoms, i.e., arbitrary Boolean connectives and leading existential quantifiers. BS formulas 
can be polynomially transformed into clause sets with common syntactic transformations while 
preserving satisfiability and all entailments that do not refer to auxiliary constants and predicates 
introduced in the transformation [32]. BS theories in our sense are also known as disjunctive Dat- 
alog programs [16], specifically when written as implications. A HBS clause set is also called a 
Datalog program. Datalog is sometimes viewed as a second-order language. We are only inter- 
ested in query answering, which can equivalently be viewed as first-order entailment or second- 
order model checking [1]. Again, it is common to write clauses as implications in this case. 

Two types of conjectures, i.e., formulas we want to prove as consequences of a clause set, 
are of particular interest: universal conjectures Vx. and existential conjectures Ax.¢, where ġ 
is a BS formula that only uses variables in x. We call such a conjecture positive if the formula 
only uses conjunctions and disjunctions to connect atoms. Positive conjectures are the focus of 
our Datalog hammer and they have the useful property that they can be transformed to one atom 
over a fresh predicate symbol by adding some suitable Horn clause definitions to our clause 
set N [32,11]. This is also the reason why we assume for the rest of the paper that all relevant 
universal conjectures have the form Vx.P(x) and existential conjectures the form 3¥.P (¥). 

A substitution o is a function from variables to terms with a finite domain dom(c~) = {x | 
xo +x} and codomain codom(c) = {xo |x € dom(c-)}. We denote substitutions by o,6,o. The 
application of substitutions is often written postfix, as in xo, and is homomorphically extended 
to terms, atoms, literals, clauses, and quantifier-free formulas. A substitution o` is ground if 
codom(c-) is ground. Let Y denote some term, literal, clause, or clause set. 7 is a grounding 
for Y if Yo is ground, and Yo is a ground instance of Y in this case. We denote by gnd(Y) the 
set of all ground instances of Y, and by gnd z (Y) the set of all ground instances over a given set 
of constants B. The most general unifier mgu(Z1,Z2) of two terms/atoms/literals Z; and Z> is 
defined as usual, and we assume that it does not introduce fresh variables and is idempotent. 

We assume a standard many-sorted first-order logic model theory, and write AF ¢ if an 
interpretation A satisfies a first-order formula ¢. A formula y is a logical consequence of 
ġ, written d EW, if AEw for all A such that AK @. Sets of clauses are semantically treated 
as conjunctions of clauses with all variables quantified universally. 
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BS with Linear Arithmetic: The extension of BS with linear arithmetic both over real and 
integer variables, BS(LA), is the basis for the formalisms studied in this paper. We extend 
the standard many-sorted first-order logic with finitely many first-order sorts F; and with 
two arithmetic sorts R for the real numbers and Z for the integer numbers. The sort Z is 
a subsort of R. Given a clause set N, the interpretations A of our sorts are fixed: RA= R, 
ZA =Z, and F^ =F;, i.e., a first-order sort interpretation F; consists of the set of constants 
in N belonging to that sort, or a single constant out of the signature if no such constant occurs. 
Note that this is not a deviation from standard semantics in our context as for the arithmetic 
part the canonical domain is considered and for the first-order sorts BS has the finite model 
property over the occurring constants which is sufficent for refutation-based reasoning. This 
way first-order constants are distinct values. 

Constant symbols, arithmetic function symbols, variables, and predicates are uniquely 
declared together with sort expressions. The unique sort of a constant symbol, variable, 
predicate, or term is denoted by the function sort(Y) and we assume all terms, atoms, and 
formulas to be well-sorted. The sort of predicate P’s argument position i is denoted by sort(P,7). 
For arithmetic function symbols we consider the minimal sort with respect to the subsort relation 
between R and Z. Eventually, we don’t consider arithmetic functions here, so the subsort 
relationship boils down to substitute an integer sort variable or number for a real sort variable. 

We assume pure input clause sets, which means the only constants of sort R or Z are 
numbers. This means the only constants that we do allow are integer numbers c € Z and the 
constants defining our finite first-order sorts F;. Satisfiability of pure BS(LA) clause sets is 
semi-decidable, e.g., using hierarchic superposition [3] or SCL(T) [12]. Impure BS(LA) is 
no longer compact and satisfiability becomes undecidable, but it can be made decidable when 
restricting to ground clause sets [18]. 

All arithmetic predicates and functions are interpreted in the usual way. An interpretation 
of BS(LA) coincides with A‘ on arithmetic predicates and functions, and freely interprets 
free predicates. For pure clause sets this is well-defined [3]. Logical satisfaction and entailment 
is defined as usual, and uses similar notation as for BS. 


Example 1. The following BS(LA) clause from our ECU case study compares the values of 
engine speed (Rpm) and pressure (KPa) with entries in an ignition table ([gnTable) to derive 
the basis of the current ignition value (IgnDeg1): 


x1 <OVX > 13 V x2 < 880 V x2 > 1100 V AKPa(x3,x1) Vv 


aRpm(x4,x2) V —IgnTable (0,13,880,1100,z) V IgnDeg1 (x3,x4,x1,X2,z) (1) 

Terms of the two arithmetic sorts are constructed from a set ¥ of variables, the set of integer 
constants c € Z, and binary function symbols + and — (written infix). Atoms in BS(LA) are 
either first-order atoms (e.g., IgnTable(0,13,880, 1 100,z)) or (linear) arithmetic atoms (e.g., x2 < 
880). Arithmetic atoms may use the predicates <,<,#,=,>,>, which are written infix and have 
the expected fixed interpretation. Predicates used in first-order atoms are called free. First-order 
literals and related notation is defined as before. Arithmetic literals coincide with arithmetic 
atoms, since the arithmetic predicates are closed under negation, e.g., (x2 > 1100) =x2 < 1100. 
BS(LA) clauses and conjectures are defined as for BS but using BS(LA) atoms. We often 
write Horn clauses in the form A || A— H where A is a multiset of free first-order atoms, H 
is either a first-order atom or L, and A is a multiset of LA atoms. The semantics of a clause in 
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the form A || A His Vieca AV V Aca 77 AVA, e.g., the clause x > 1Vy#5V3Q(x) V R(x,y) 
is also written x < 1,y=5||Q(x) > R(x,y). 

A clause or clause set is abstracted if its first-order literals contain only variables or 
first-order constants. Every clause C is equivalent to an abstracted clause that is obtained 
by replacing each non-variable arithmetic term ¢ that occurs in a first-order atom by a fresh 
variable x while adding an arithmetic atom x #t to C. We asssume abstracted clauses for theory 
development, but we prefer non-abstracted clauses in examples for readability,e.g., a fact 
P(3,5) is considered in the development of the theory as the clause x =3,x=5 ||— P(x,y), this 
is important when collecting the necessary test points. Moreover, we assume that all variables 
in the theory part of a clause also appear in the first order part, i.e., vars(A) € vars(A —> H) 
for every clause A || A — H. If this is not the case for x in A || A — H, then we can easily 
fix this by first introducing a fresh unary predicate Q over the sort(x), then adding the literal 
Q(x) to A, and finally adding a clause ||— Q(x) to our clause set. Alternatively, x could be 
eliminated by LA variable elimintation in our context, however this results in a worst case 
exponential blow up in size. This restriction is necessary because we base all our computations 
for the test-point scheme on predicate argument positions and would not get any test points 
for variables that are not connected to any predicate argument positions. 


Simpler Forms of Linear Arithmetic: The main logic studied in this paper is obtained by 
restricting HBS(LA) to a simpler form of linear arithmetic. We first introduce a simpler logic 
HBS(SLA) as a well-known fragment of HBS (LA) for which satisfiability is decidable [19,22], 
and later present the generalization HBS(LA)PA of this formalism that we will use. 


Definition 2. The Horn Bernays-Schonfinkel fragment over simple linear arithmetic, HBS(SLA), 
is a subset of HBS(LA) where all arithmetic atoms are of the form xsc or d<c, such that 
cEZ, dis a (possibly free) constant, x€ X, and «€ {<,<,#,=,>,2}. 


Please note that HBS(SLA) clause sets may be unpure due to free first-order constants 
of an arithmetic sort. Studying unpure fragments is beyond the scope of this paper but they 
show up in applications as well. 


Example 3. The ECU use case leads to HBS(LA) clauses such as 


X1<y1V X1 2 y2 VX2 <3 V X2 2 y4 V AKPa(x3,x1) V (2) 
=Rpm(x4,x2) V sIgnTable(y,y2,v3,4,z) V IgnDeg] (x3,%4,«1,%2,2). 


This clause is notin HBS(SLA), e.g., since x, > x5 is not allowed in BS(SLA). However, clause 
(1) of Example | is a BS(SLA) clause that is an instance of (2), obtained by the substitution 
{yı > 0,y2 => 13,y3 > 880, y4 + 1100}. This grounding will eventually be obtained by 
resolution on the IgnTable predicate, because it occurs only positively in ground unit facts. 


Example 3 shows that HBS(SLA) clauses can sometimes be obtained by instantiation. In 
fact, for the satisfiability of an HBS(LA) clause set N only those instances of clauses (A || A> 
H)o are relevant, for which we can actually derive all ground facts A € Ac by resolution from 
N. If A cannot be derived from N and N is satisfiable, then there always exists a satisfying 
interpretation A that interprets A as false (and thus (A || A — H)o as true). Moreover, if those 
relevant instances can be simplified to HBS(SLA) clauses, then it is possible to extend almost 
all HBS(SLA) techniques (including our Datalog hammer) to those HBS(LA) clause sets. 
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In our case resolution means hierarchic unit resolution: given a clause A, || L,A— H and 
a unit clause Az ||— K with o=mgu(L,K), their hierarchic resolvent is (A,,A2||A— A)o. 
A fact P(@) is derivable from a pure set of HBS(LA) clauses N if there exists a clause 
A||— P(@®) that (i) is the result of a sequence of unit resolution steps from the clauses in N and 
(ii) has a grounding o such that P(f)o = P(a) and Ac evaluates to true. If N is satisfiable, then 
this means that any fact P(@) derivable from N is true in all satisfiable interpretations of N, 
i.e., NE P(a@). We denote the set of derivable facts for a predicate P from N by dfacts(P,N). 
A refutation is the sequence of resolution steps that produces a clause A ||— L with ALA KAS 
for some grounding 6. Hierarchic unit resolution is sound and refutationally complete for 
pure HBS(LA), since every set N of pure HBS(LA) clauses N is sufficiently complete [3], 
and hence hierarchic superposition is sound and refutationally complete for N [3,6]. 

So naturally if all derivable facts of a predicate P already appear in N, then only those 
instances of clauses can be relevant whose occurrences of P match those facts (i.e., can be 
resolved with them). We call predicates with this property positively grounded: 


Definition 4 (Positively Grounded Predicate [11]). Let N be a set of HBS(LA) clauses. A 
free first-order predicate P is a positively grounded predicate in N if all positive occurrences 
of P in N are in ground unit clauses (also called facts). 


Definition 5 (Positively Grounded HBS (SLA): HBS(SLA)P [11]). An HBS(LA) clause 
set N is out of the fragment positively grounded HBS(SLA) (HBS(SLA)P) if we can 
transform N into an HBS(SLA) clause set N’ by first resolving away all negative occurrences 
of positively grounded predicates P in N, simplifying the thus instantiated LA atoms, and 
finally eliminating all clauses where those predicates occur negatively. 


As mentioned before, if all relevant instances of an HBS(LA) clause set can be simplified 
to HBS(SLA) clauses, then it is possible to extend almost all HBS(SLA) techniques (including 
our Datalog hammer) to those clause sets. HBS(SLA)P clause sets have this property and this 
is the reason, why we managed to extend our Datalog hammer to pure HBS(SLA)P clause 
sets in [11]. For instance, the set VN={P(1), P(2), Q(0), (x < y+z || P(y),Q(z)  R(x,y))} is 
an HBS(LA) clause set, but not an HBS(SLA) clause set due to the inequality x < y+z. Note, 
however, that the predicates P and Q are positively grounded, the only positive occurrences of P 
and Q are the facts P(1), P(2), and Q(0). If we resolve with the facts for P and Q and simplify, 
then we get the clause set N’ = {P(1), P(2), Q(0), (x <1 || — R(x, 1), (x <2 || R(x,2))}, 
which does now belong to HBS(SLA). This means N is a positively grounded HBS(SLA) 
clause set and our Datalog hammer can still handle it. 

Positively grounded predicates are only one way to filter out irrelevant clause instances. As 
part of our improvements, we define in Section 3 a new logic called approximately grounded 
HBS(SLA) (HBS(SLA)PA) that is an extension of HBS(SLA)P and serves as the new input 
logic of our sorted Datalog hammer. 


Test-Point Schemes and Functions The Datalog hammer in [11] is based on the following 
idea: For any pure HBS(SLA) clause set N that is unsatisfiable, we only need to look 
at the instances gnd,(N) of N over finitely many test points B to construct a refutation. 
Symmetrically, if N is satisfiable, then we can extrapolate a satisfying interpretation for N 
from a satisfying interpretation for gnd, (JV). If we can compute such a set of test points B for 
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a clause set N, then we can transform the clause set into an equisatisfiable Datalog program. 
There exist similar properties for universal/existential conjectures. A test-point scheme is an 
algorithm that can compute such a set of test points B for any HBS(SLA) clause set N and 
any conjecture N F Ox.P(x) with Qe {AV}. 

The test-point scheme used by our original Datalog hammer computes the same set of 
test points for all variables and predicate argument positions. This has several disadvantages: 
(i) it cannot handle variables with different sorts and (ii) it often selects too many test points 
(per argument position) because it cannot recognize which inequalities and which argument 
positions are connected. The goal of this paper is to resolve these issues. However, this also 
means that we have to assign different test-point sets to different predicate argument positions. 
We do this with so-called test-point functions. 


A test-point function (tp-function) £ is a function that assigns to some argument positions 
i of some predicates P a set of test points G(P,i). An argument position (P,/) is assigned a 
set of test points if B(P,i) C sort(P,i)4 and otherwise B(P,i) =. A test-point function £ is 
total if all argument positions (P,i) are assigned, i.e., B(P,i) # L. 

A variable x of a clause A || A — H occurs in an argument position (P,1) if (P,i) € 
depend(x,A || A — H), where depend(x,Y)={(P,i) | P(T) € atoms(Y) and t; =x}. Similarly, a 
variable x of an atom Q(f) occurs in an argument position (Q,i) if (Q,i) edepend(x,Q(f)). A 
substitution o for a clause Y or atom Y is a well-typed instance over a tp-function £ if it guar- 
antees for each variable x that xo is an element of sort(x)~ and part of every test-point set (i.e., 
xo € B(P,1)) of every argument position (P,i) it occurs in (i.e., (P,i) € depend(x,Y)) and that 
is assigned a test-point set by £ (i.e., 8(P,2) + L). To abbreviate this, we define a set wti(x,Y,8) 
that contains all values with which a variable can fulfill the above condition, i.e., wti(x,Y,B) = 
sort(x)4 N (CP i) edepend(x,Y) and B(P,i)41 B(P,i)). Following this definition, we denote by 
wtisg (Y) the set of all well-typed instances for a clause/atom Y over the tp-function £, or for- 
mally: wtisg (Y) = {a |Vx € vars(Y).(xo") € wti(x,Y,8)}. With the function gndg, we denote the 
set of all well-typed ground instances of a clause/atom Y over the tp-function £, i.e., gndg(Y) = 
{Yo |o €wtisg(Y)}, or a set of clauses N, i.e., gndg(N)={Yo |Y €N and o € wtisg(Y)}. 


The most general tp-function, denoted by 6*, assigns each argument position to the 
interpretation of its sort, i.e., B*(P,i) =sort(P,i)4. So depending on the sort of (P,i), either 
to R, Z, or one of the F;. A set of clauses N is satisfiable if and only if gnd,. (N), the set 
of all ground instances of N over the base sorts, is satisfiable. Since 6* is the most general 
tp-function, we also write gnd(Y) for gndg.(Y) and wtis(Y) for wtisg:(Y). 

If we restrict ourselves to test points, then we also only get interpretations over test points 
and not for the full base sorts. In order to extrapolate an interpretation from test points to 
their full sorts, we define extrapolation functions (ep-functions) 7. An extrapolation function 
(ep-function) 7(P,a) maps an argument vector of test points for predicate P (with a; € B(P,1)) 
to the subset of sort(P,1)4x...xsort(P,n)“ that is supposed to be interpreted the same as 
ā, i.e., P(ā) is interpreted as true if and only if P(b) with b €n(P,d) is interpreted as true. 
By default, any argument vector of test points a for P must also be an element of ņn(P,ā), i.e., 
aén(P,a). An extrapolation function does not have to be complete for all argument positions, 
i.e., there may exist argument positions from which we cannot extrapolate to all argument 
vectors. Formally this means that the actual set of values that can be extrapolated from (P,i) 
Ge., Uai ep(p,1) “Ua, e8(P.n)7(P.@) may be a strict subset of sort(P,1)4x...xsort(P,n)“. 
For all other values a, P(@) is supposed to be interpreted as false. 
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Covering Clause Sets and Conjectures Our goal is to create total tp-functions that restrict 
our solution space from the infinite reals and integers to finite sets of test points while 
still preserving (un)satisfiability. Based on these tp-functions, we are then able to define a 
Datalog hammer that transforms a clause set belonging to (an extension of) HBS(LA) into 
an equisatisfiable HBS clause set; even modulo universal and existential conjectures. 

To be more precise, we are interested in finite tp-functions (together with matching 
ep-functions) that cover a clause set N or a conjecture N H Ox.P(x) with Q € {3,V}. A total 
tp-function £ is finite if each argument position is assigned to a finite set of test points, i.e., 
|B(P.)|€N. A tp-function £ covers a set of clauses N if gnd,(N) is equisatisfiable to N. A 
tp-function £ covers a universal conjecture Vx.Q(x) over N if gndg(N)UNog is satisfiable if 
and only if N FVx.Q(x) is false. Here No is the set {|| gndg(Q(x)) — L} if 7 is complete for 
Q or the empty set otherwise. A tp-function £ covers an existential conjecture N - Ax.Q(x) 
if gnd,(N)Ugnd, (|| Q(x) > L) is satisfiable if and only if N FAv.Q(x) is false. 

The most general tp-function 6* obviously covers all HBS(LA) clause sets and conjectures 
because satisfiability of N is defined over gndg. (N). However, 8” is not finite. The test-point 
scheme in [11], which assigns one finite set of test points B to all variables, also covers clause 
sets and universal/existential conjectures; at least if we restrict our input to variables over 
the reals. As mentioned before, the goal of this paper is improve this test-point scheme by 
assigning different test-point sets to different predicate argument positions. 


3 The Sorted Datalog Hammer 


In this section, we present a transformation that we call the sorted Datalog hammer. It 
transforms any pure HBS(SLA) clause set modulo a conjecture into an HBS clause set. To 
guide our explanations, we apply each step of the transformation to a simplified example of 
the electronic control unit use case: 


Example 6. An electronic control unit (ECU) of a combustion engine determines actuator 
operations. For instance, it computes the ignition timings based on a set of input sensors. To 
this end, it looks up some base factors from static tables and combines them to the actual 
actuator values through a series of rules. 

In our simplified model of an ECU, we only compute one actuator value, the ig- 
nition timing, and we only have an engine speed sensor (measuring in Rpm) as our 
input sensor. Our verification goal, expressed as a universal conjecture, is to confirm, 
that the ECU computes an ignition timing for all potential input sensor values. Deter- 
mining completeness of a set of rules, i.e., determining that the rules produce a re- 
sult for all potential input values, is also our most common application for universal 
conjectures. The ECU model is encoded as the following pure HBS(LA) clause set 
N: 

D,:SpeedTable(0,2000,1350), D2 :SpeedTable(2000,4000,1600), 
D3: SpeedTable(4000,6000,1850), D4:SpeedTable(6000,8000,2100), 
C1:0<xp,Xp <8000||— Speed(x,), 

C2:X1 SXp.Xp <X2 || Speed(x,),SpeedTable(x1.x2,y) > IgnDeg(xp.y), 
C3: IgnDeg(xp,z) > ResArgs(xp), C4:ResArgs(xp) > Conj(xp), 

C5: Xp = 8000 || Conj(xp), Ce :xp <0||— Conj(x,), 
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In this example all variables are real variables. The clauses D; — D4 are table entries from 
which we determine the base factor of our ignition time based on the speed. Semantically, 
D,:SpeedTable(0,2000,1350) states that the base ignition time is 13.5° before dead center if 
the engine speed lies between ORpm and 2000Rpm. The clause Cı produces all possible input 
sensor values labeled by the predicate Speed. The clause C2 determines the ignition timing from 
the current speed and the table entries. The end result is stored in the predicate IgnDeg(xp,z), 
where z is the resulting ignition timing and x, is the speed that led to this result. The clauses 
C3 — C6 are necessary for encoding the verification goal as a universal conjecture over a 
single atom. In clause C3, the return value is removed from the result predicate IgnDeg(xp,z) 
because for the conjecture we only need to know that there is a result and not what the result is. 
Clause C4 guarantees that the conjecture predicate Conj(x,) is true if the rules can produce a 
IgnDeg(x),z) for the sensor value. Clauses C;&Cg guarantee that the conjecture predicate is 
true if one of the sensor values is out of bounds. This flattening process can be done automatically 
using the techniques outlined in [11]. Hence, the ECU computes an ignition timing for all 
potential input sensor values if the universal conjecture Vx, .Conj(x,) is entailed by N. 


Approximately Grounded Example 6 contains inequalities that go beyond simple variable 
bounds, e.g., xı <xp in C2. However, it is possible to reduce the example to an HBS(SLA) 
clause set. As our first step of the sorted Datalog hammer, we explain a way to heuristically 
determine which HBS(LA) clause sets can be reduced to HBS(SLA) clause sets. Moreover, 
we show later that we do not have to explicitly perform this reduction but that we can extend 
our other algorithms to handle this heuristic extension of HBS(SLA) directly. 


We start by formulating an extension of positively grounded HBS(SLA) called approx- 
imately grounded HBS(SLA). It is based on over-approximating the set of derivable values 
dvals(P,i,N) ={a; | P (å) € dfacts(P,N)} for each argument position i of each predicate P in N 
with only finitely many derivable values, i.e., |dvals (P,i,N)| € N. These argument positions are 
also called finite. Naturally, all argument positions over first-order sorts F are finite argument 
positions. With regard to clause relevance, only those clause instances are relevant, where a 
finite argument position is instantiated by one of the derivable values. We call a set of clauses N 
an approximately grounded HBS(SLA) clause set if all relevant instances based on this crite- 
rion can be simplified to HBS(SLA) clauses. For instance, the set N ={(x < 1 || P(x,1)), (x> 
2||— P(x,3)), (x 20 || Q(x,0)), (u < y+z || P@,y),O(x,z) R(x, y,z,u))} is an HBS(LA) 
clause set, but not a (positively grounded) HBS(SLA) clause set due to the inequality z < y+u 
and the lack of positively grounded predicates. However, the argument positions (P,2), (Q,2), 
(R,2) and (R,3) only have finitely many derivable values dvals(P,2,N) =dvals(R,2,N) = {1,3} 
and dvals(Q,2,N) = dvals(R,3,N) = {0}. If we instantiate all occurrences of P and Q over 
those values, then we get the set N’ = {(x < 1 |> P(@,1)), œ > 2 || P(x,3)), (x 2 0 ||> 
Q(x,0)), (u < 1 || P(x,1),Q (x,0) > R(x,1,0,u)), (u < 3 || P(x,3),Q(x,0) — R(x,3,0,u))} that 
is an HBS (SLA) clause set. This means N is an approximately grounded HBS(SLA) clause 
set and our extended Datalog hammer can handle it. 


Determining the finiteness of a predicate argument position (and all its derivable values) 
is not trivial. In general, it is as hard as determining the satisfiability of a clause set [10], so in 
the case of HBS(LA) undecidable [15,23]. This is the reason, why we only over-approximate 
the derivable values with the following algorithm. 
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DeriveValues(V) 
for all predicates P and argument positions i for P 
avals(P,i,N) :=0; 
change :=T; 
while (change) 
change := L; 
for all Horn clauses A || A —> P(t1,...tn) EN 
for all argument positions 1 <i <n where avals(P,i,N) +R 
if [(t;=c) or t; is assigned a constant c in A and c ¢avals(P,i,N)] then 
avals(P,i,N) :=avals(P,i,N) U{c},change:=T; 
else if [t; appears in argument positions (Q),k1),....(Qm,km) in A 
and avals(P,i,N) 2 N javals(Q j,k j,N) ] then 
if [R#(),avals(Q;,k;,N)] then 
avals(P,i,N) :=avals(P,i,N)U N ;avals(Q ;,k;,N),change :=T; 
else 
avals(P,i,N) :=R,change := T; 


At the start, DeriveValues( N) sets avals(P,i,N) =0 for all predicate argument positions. 
Then it repeats iterating over the clauses in N and uses the current sets avals in order to derive 
new values, until it reaches a fixpoint. Whenever, DeriveValues(V) computes that a clause 
can derive infinitely many values for an argument position, it simply sets avals(P,i,N) =R 
for both real and integer argument positions. This is the case, when we have a clause 
A||A— P(t,,...,fn), and an argument position i for P, such that: (i) t; is not a constant (and 
therefore a variable), (ii) t; is not assigned a constant c in A (i.e., there is no equation t; =c in 
A), (iii) t; is only connected to argument positions (Q1,k1),....(Qm.km) in A that already have 
avals(Q ;,k ;,N) =R. The latter also includes the case that t; is not connected to any argument 
positions in A. For instance, DeriveValues(V) would recognize that clause C4 in example 6 
can be used to derive infinitely many values for the argument position (Speed,1) because 
the variable x, is not assigned an equation in C;’s theory constraint A := (0 <xp,Xp <8000) 
and xp is not connected to any argument position on the left side of the implication. Hence, 
DeriveValues(V) would set avals(Speed,1,N) =R. 

For eachrun through the while loop, at least one predicate argument position is set to R or the 
set is extended by at least one constant. The set of constants in N as well as the number of predi- 
cate argument positions in N are finite, hence DeriveValues(V) terminates. It is correct because 
in each step it over-approximates the result of a hierarchic unit resulting resolution step, see Sec- 
tion 2. The above algorithm is highly inefficient. In our own implementation, we only apply it if 
all clauses are non-recursive and by first ordering the clauses based on their dependencies. This 
guarantees that every clause is visited at most once and is sufficient for both of our use cases. 

Based on avals, we can now build a tp-function 67 that maps all finite argument positions 
(P,i) that our over-approximation detected to the over-approximation of their derivable values, 
i.e., B1 (P,i) :=avals(P,i,N) if |avals(P,i,N)| E N and 8° (P,i) := L otherwise. With 67 we 
derive the finitely grounded over-approximation agnd(Y) of a set of clauses Y, a clause Y 
or an atom Y. This set is equivalent to gndga (Y), except that we assume that all LA atoms 
are simplified until they contain at most one integer number and that LA atoms that can be 
evaluated are reduced to true and false and the respective clause simplified. Based of agnd(N) 
we define a new extension of HBS(SLA) called approximately grounded HBS(SLA): 


A Sorted Datalog Hammer for Supervisor Verification Conditions 491 


Definition 7 (Approximately Grounded HBS(SLA): HBS(SLA)A). A clause set N is out 
of the fragment approximately grounded HBS(SLA) or short HBS(SLA)A if agnd(N) is out 
of the HBS(SLA) fragment. It is called HBS(SLA)PA if it is also pure. 


Example 8. Executing DeriveValues(V) on example 6 leads to the following results: 
avals(SpeedTable, 1 ,N) = {0,2000,4000,6000}, 

avals(SpeedTable,2,) = {2000,4000,6000,8000}, 

avals(SpeedTable,3,) = {1350,1600,1850,2100}, 

avals(IgnDeg,2,N) = {1350,1600,1850,2100}, 

and all other argument positions (P,i) are infinite so avals(P,i,N) =R for them. 

We can now easily check whether agnd(V) would turn our clause set into an HBS(SLA) 
fragment by checking whether the following holds for all inequalities: all variables in the 
inequality except for one must be connected to a finite argument position on the left side of the 
clause it appears in. This guarantees that all but one variable will be instantiated in agnd(V) 
and the inequality can therefore be simplified to a variable bound. 


Connecting Argument Positions and Selecting Test Points As our second step, we are 
reducing the number of test points per predicate argument position by incorporating that 
not all argument positions are connected to all inequalities. This also means that we select 
different sets of test points for different argument positions. For finite argument positions, 
we can simply pick avals(P,i,N) as its set of test points. However, before we can compute the 
test-point sets for all other argument positions, we first have to determine to which inequalities 
and other argument positions they are connected. 

Let N be an HBS(SLA)PA clause set and (P,i) an argument position for a predicate 
in N. Then we denote by conArgs(P,i,N) the set of connected argument positions and by 
conIneqs(P,i,N) the set of connected inequalities. Formally, conArgs(P,i,N) is defined as 
the minimal set that fulfills the following conditions: (i) two argument positions (P,i) and 
(Q,j) are connected if they share a variable in a clause in N, i.e., (Q,j) econArgs(P,i,N) if 
(A||A> A) EN, P(t),Q (5) €atoms(AU {H}), and t; =s; =x; and (ii) the connection relation 
is transitive, i.e., if (Q,7) €conArgs(P,i,N), then conArgs(P,i,N) =conArgs(Q,7,N). Simi- 
larly, conIneqs(P,i,N) is defined as the minimal set that fulfills the following conditions: (i) an 
argument position (P,i) is connected to an instance A’ of an inequality 4 if they share a variable 
in a clause in N, i.e., 2’ EconIneqs(P,i,N) if (A|| A> H) EN, P(t) €atoms(AU{H}), t; =x, 
(A’ || A’ > A’) €agnd(A || A> H), a’ € A’, and a’ =x«c (where <= {<,>,<,>,=,#} and c € Z); 
(ii) an argument position (P,i) is connected to a value c € Z if P(t) with t; = c appears in a clause 
in N, i.e., (x=c) €conIneqs(P,i,N) if (A || A— H) € N, P(t) €atoms(AU {H}), and t; =c; 
(iii) an argument position (P,7) is connected to a value c € Zif (P,i) is finite and c € avals(P,i,N), 
i.e., (x=c) €conIneqs(P,i,N) if (Pi) is finite and c € avals(P,i,N); and (iv) the connection rela- 
tion is transitive, i.e., 4 E conArgs(Q,7,N) if A € conIneqs(P,i,N) and (Q, j) econArgs(P,i,N). 


Example 9. To highlight the connections in example 6 more clearly, we use the same variable 
symbol for connected argument positions. Therefore (SpeedTable,1) and (SpeedTable,2) are 
only connected to themselves and conArgs(SpeedTable,3,N) = {(SpeedTable,3),(IgnDeg,2)}, 
and conArgs(Speed,1,N) = {(Speed, 1), (IgnDeg, 1),(ResArgs, 1),(Conj,1)}, Computing the 
connected argument positions is a little bit more complicated: first, if a connected argument 
position is finite, then we have to add all values in avals as equations to the connected 
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inequalities. E.g., conIneqs(SpeedTable, 1,) = {x1 =0,x; =2000,x,; =4000,x; = 6000} be- 
cause avals(SpeedTable, 1,N) = {0,2000,4000,6000}. Second, we have to add all inequalities 
connected in agnd(N). Again this is possible without explicitly computing agnd(/). E.g., for 
the inequality xı <x, in clause C2, we determine that xı is connected to the finite argument 
position (SpeedTable,1) in C2 and xp is not connected to any finite argument positions. Hence, 
we have to connect the following variable bounds to all argument positions connected to xp, 
i.e., {x1 < Xp | x; €avals(SpeedTable, 1,N)} = {xp 2 0,x, = 2000,x, = 4000,x,, = 6000} to 
the argument positions conArgs(Speed,1,/V). If we apply the above two steps to all clauses, 
then we get as connected inequalities: conIneqs(SpeedTable,2,N) = {x2 =2000,x2 =4000,x3 = 
6000,x4 = 8000}, conIneqs(SpeedTable,3,) = {y = 1350,y = 1600, y = 1850, y =2100}, and 
conIneqs(Speed, 1,N) = {xp < 0,xp < 2000,x» < 4000,xp < 6000,xp < 8000,x, > 0,xp > 
2000,x , > 4000,x,, = 6000,x,, > 8000}. 


Now based on these sets we can construct a set of test points as follows: For each 
argument position (P,i), we partition the reals R into intervals such that any variable bound in 
A€ conIneqs(P,i,N) is satisfied by all points in one such interval Z or none. Since we are in the 
Horn case, this is enough to ensure that we derive facts uniformly over those intervals and the 
integers/non-integers. To be more precise, we derive facts uniformly over those intervals and 
the integers because P(@) is derivable from N and a; € INZ implies that P(b) is also derivable 
from N, where b; =a; for i+ j and b; € IOAZ. Similarly, we derive facts uniformly over those 
intervals and the non-integers because P(@) is derivable from N and a; €/\Z implies that P(b) 
is also derivable from N, where b; =a; for i+ j and b; €T. As a result, it is enough to pick (if 
possible) one integer and one non-integer test point per interval to cover the whole clause set. 

Formally we compute the interval partition iPart(P,i,N) and the set of test points tps(P,i,N) 
as follows: First we transform all variable bounds 4 € conIneqs(P,i,N) into interval borders. A 
variable bound xac with 4€ {<,<,>,>} in conIneqs(P,i,N) is turned into two interval borders. 
One of them is the interval border implied by the bound itself and the other its negation, e.g., 
x >5 results in the interval border [5 and the interval border of the negation 5). Likewise, 
we turn every variable bound xac with <€ {=,#} into all four possible interval borders for 
c, ie. c), [c, c], and (c. The set of interval borders iEP(P,i,N) is then defined as follows: 


iEP(P,i,N) = {c],(c|x«c €conIneqs(P,i,N) where <€ {<,=,4,>}}U 
{c),[c|x<c €conIneqs(P,i,N) where <€ {>,=,4,<}} U {(—co,00)} 


The interval partition iPart(P,i,N) can be constructed by sorting iEP(P,i,N) in an 
ascending order such that we first order by the border value—i.e. 6 < € if 6 € {c),[c,c],(c}, 
€ € {d),[d,d],(d}, and c < d—and then by the border type—i.e. c) < [c < c] < (c. The 
result is a sequence [...,67,6y,...], where we always have one lower border 6;, followed by 
one upper border 6,,. We can guarantee that an upper border 6,, follows a lower border 67 
because iEP(P,i,N) always contains c) together with [c and c] together with (c for c € Z, so 
always two consecutive upper and lower borders. Together with (—co and oo) this guarantees 
that the sorted iEP(P,i,N) has the desired structure. If we combine every two subsequent 
borders 67, ôu in our sorted sequence [...,67,6,,,...], then we receive our partition of intervals 
iPart(P,i,N). For instance, if x <5 and x=0 are the only variable bounds in conIneqs(P,i,N), 
then iEP(P,i,N) = {5),[5,0), [0,0], (0,(—c0,co)} and if we sort and combine them we get 
iPart(P,i,N) = {(—co,0),[0,0],(0,5),[5,00) }. 
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After constructing iPart(P,i,NV), we can finally construct the set of test points tps(P,i,N) 
for argument position (P,i). If |avals(P,i,N)| € N, i.e., we determined that (P,i) is finite, 
then tps(P,i,N) =avals(P,i,N). If the argument position (P,i) is over a first-order sort F;, 
i.e., sort(P,i) = F;, then we should always be able to determine that (P,7) is finite because 
F; is finite. If the argument position (P,i) is over an arithmetic sort, i.e., sort(P,i) =R or 
sort(P,i) =Z, and our approximation could not determine that (P,i) is finite, then the test-point 
set tps(P,i,NV) for (P,i) consists of at most two points per interval 7 €iPart(P,i,N): one integer 
value ay € IAZ if I contains integers (i.e. if ZIOZ +0) and one non-integer value bz €/\Z 
if I contains non-integers (i.e. if J is not just one integer point). Additionally, we enforce that 
tps(P.i,N) =tps(Q,7,N) if conArgs(P,1,N) =conArgs(Q,/,N) and both (P,7) and (Q,j) are 
infinite argument positions. (In our implementation of this test-point scheme, we optimize 
the test point selection even further by picking only one test point per interval—if possible 
an integer value and otherwise a non-integer—if all conArgs(P,i,N) and all variables x 
connecting them in N have the same sort. However, we do not prove this optimization explicitly 
here because the proofs are almost identical to the case for two test points per interval.) 

Based on these sets, we can now also define a tp-function 8 and an ep-function 77. For the 
tp-function, we simply assign any argument position to tps(P,i,N), i.e., B(P.i) =tps(P,i, N) A 
sort(P,i)“. (The intersection with sort(P,i)4 is needed to guarantee that the test-point set of 
an integer argument position is well-typed.) This also means that £ is total and finite. For the 
ep-function 7, we extrapolate any test-point vector @ (with 4a=xo and ø € wtisg(P(x))) over the 
(non-)integer subset of the intervals the test points belong to, i.e., n(P,ā) =I, x... Ip, where I; = 
{a;} if we determined that (P,i) is finite and otherwise 7; is the interval J; €iPart(P,i,N) with 
a, € I; and I; =I; AOZ if a; is an integer value and J; =1;\Z if a; is a non-integer value. Note that 
this means that 7 might not be complete for every predicate P, e.g., when P has a finite argument 
position (P,i) with an infinite domain. However, both 8 and 7 together still cover the clause set N, 
cover any universal conjecture N EVx.Q(x), and cover any existential conjecture N FAx.Q (x). 


Theorem 10. The tp-function £ covers N. The tp-function £ covers an existential conjecture 
NEAx.Q(x). The tp-function 8 covers a universal conjecture N EVx.Q(x). 


Example 11. Continuation of example 6: The majority of argument positions in our example 
are finite. Hence, determining their test point set is equivalent to the over-approximation 
of derivable values avals we computed for them: G(SpeedTable, 1) = {0,2000,4000,6000}, 
B(SpeedTable,2) = {2000,4000,6000,8000}, 8(SpeedTable,3) = {1350, 1600, 1850,2100}, 
and 6(IgnDeg,2) = {1350,1600,1850,2100}. The other argument positions are all connected 
to (Speed, 1) and conIneqs(Speed,1,N) = {xp < 0,xp < 2000,xp < 4000,x, < 6000,xp < 
8000,x, = 0,xp = 2000,x, = 4000,x,, > 6000,x,, = 8000}, from which we can compute 
iPart(P,i,N) ={(—09,0),[0,2000),[2000,4000) ,[4000,6000), [6000,8000), [8000,00) } 

and select the test point sets B(Speed, 1) = B(IgnDeg, 1) = B(ResArgs, 1) = B(Conj, 1) = 
{—1,0,2000,4000,6000,8000}. (Note that all variables in our problem are over the reals, so 
we only have to select one test point per interval! Moreover, in our previous version of the test 
point scheme, there would have been more intervals in the partition because we would have 
processed all inequalities, e.g., also those in conIneqs(SpeedTable,3,).) The ep-function 
7 that determines which interval is represented by which test point is 7(P,1,—1) =(—0o,0), 
n(P,1,0) =[0,2000), 7(P,1,2000) = [2000,4000), 7(P,1,4000) = [4000,6000), 17(P,1,6000) = 
[6000,8000),7(P,1,8000) = [8000,co) for the predicates Speed, IgnDeg, ResArgs, and Conj. 
7 behaves like the identity function for all other argument positions because they are finite. 
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From a Test-Point Function to a Datalog Hammer We can use the covering definitions, e.g., 
gnd, (N) is equisatisfiable to N, to instantiate our clause set (and conjectures) with numbers. As 
a result, we can simply evaluate all theory atoms and thus reduce our HBS(SLA)PA clause set- 
s/conjectures to ground HBS clause sets, which means we could reduce our input into formulas 
without any arithmetic theory that can be solved by any Datalog reasoner. There is, however, one 
problem. The set gndg(N) grows exponentially with regard to the maximum number of vari- 
ables nc in any clause in N, i.e. O(|gnd,(N)|) =O(|N|-|B|"C), where B=max{(p i) (B(P.1)) 
is the largest test-point set for any argument position. Since nc is large for realistic examples, 
e.g., in our examples the size of nc ranges from 9 to 11 variables, the finite abstraction is often 
too large to be solvable in reasonable time. Due to this blow-up, we have chosen an alternative 
approach for our Datalog hammer. This hammer exploits the ideas behind the covering 
definitions and will allow us to make the same ground deductions, but instead of grounding ev- 
erything, we only need to (i) ground the negated conjecture over our tp-function and (ii) provide 
a set of ground facts that define which theory atoms are satisfied by our test points. As a result, 
the hammered formula is much more concise and we need no actual theory reasoning to solve 
the formula. In fact, we can solve the hammered formula by greedily applying unit resolution 
until this produces the empty clause—which would mean the conjecture is implied—or until 
it produces no more new facts—which would mean we have found a counter example. In 
practice, greedily applying resolution is not the best strategy and we recommend to use more 
advanced HBS techniques for instance those used by a state-of-the-art Datalog reasoner. 


The Datalog hammer takes as input (i) an HBS(SLA)PA clause set N and (ii) optionally a 
universal conjecture Yý.P (y). The case for existential conjectures is handled by encoding the 
conjecture N FAx.Q(x) as the clause set NU{Q(x) — L}, which is unsatisfiable if and only if 
the conjecture holds. Given this input, the Datalog hammer first computes the tp-function 8 and 
the ep-function ņ as described above. Next, it computes four clause sets that will make up the 
Datalog formula. The first set treny (N) is computed by abstracting away any arithmetic from 
the clauses (A || A— H) € N. This is done by replacing each theory atom A in A with a literal 
P(X), where vars(A) =vars(x) and P4 is a fresh predicate. The abstraction of the theory atoms 
is necessary because Datalog does not support non-constant function symbols (e.g., +,—) that 
would otherwise appear in approximately grounded theory atoms. Moreover, it is necessary to 
add extra sort literals =Q (p ,;,s) (x) for some of the variables x € vars( H), where H=P(?), ti =x, 
sort(x) =S, and Q,p.;,s) is a fresh predicate. This is necessary in order to define the test point 
set for x if x does not appear in A or in A. It is also necessary in order to filter out any test points 
that are not integer values if x is an integer variable (i.e. sort(x) = Z) but connected only to real 
sorted argument positions in A (i.e. sort(Q, j) =7e for all (Q,7) € depend(x,A)). It is possible to 
reduce the number of fresh predicates needed, e.g., by reusing the same predicate for two theory 
atoms whose variables range over the same sets of test points. The resulting abstracted clause has 
then the form Ay ,As,A— H, where Ar contains the abstracted theory literals (e.g. P4(x) € Ar) 
and As the “sort” literals (e.g. Q(p.i,s) (x) € As). The second set is denoted by Nc and it is 
empty if we have no universal conjecture or if 7 does not cover our conjecture. Otherwise, 
Nc contains the ground and negated version ¢ of our universal conjecture Vy.P(y) . @ has 
the form Ay — L, where Ay =gnd,(P(y)) contains all literals P(Y) for all groundings over £. 
We cannot skip this grounding but the worst-case size of Ag is O(gndg(P(y)))=O(|B|"*), 
where ng =|¥|, which is in our applications typically much smaller than the maximum number 
of variables nc contained in some clause in N. The third set is denoted by tfacts(V,8) and 
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contains a fact treny (A) for every ground theory atom A contained in the theory part A of a 
clause (A || A— H) € gnd,,(N) such that A simplifies to true. This is enough to ensure that our 
abstracted theory predicates evaluate every test point in every satisfiable interpretation A to true 
that also would have evaluated to true in the actual theory atom. Alternatively, it is also possible 
to use a set of axioms and a smaller set of facts and let the Datalog reasoner compute all relevant 
theory facts for itself. The set tfacts(V,8) can be computed without computing gnd, (N) if we 
simply iterate over all theory atoms A in all constraints A of all clauses Y =A || A— A (with 
Y € N) and compute all well typed groundings T € wtisg(Y) such that At simplifies to true. This 
can be done in time O((n,)-nz-|B|"’) and the resulting set tfacts(N,8) has worst-case size 
O(n,:|B\|”"”), where nz is the number of literals in N, n, is the maximum number of variables 
|vars(A)| in any theory atom A in N, n4 is the number of different theory atoms in N, and u(x) 
is the time needed to simplify a theory atom over x variables to a variable bound. The last set is 
denoted by sfacts(V,f) and contains a fact Op js) (a) for every fresh sort predicate Q(p j,s) 
added during abstraction and every a € B(P,i) NS“. This is enough to ensure that O(P.i,8) 
evaluates to true for every test point assigned to the argument position (P,7) filtered by the 
sort S. Please note that already satifiability testing for BS clause sets is NEXPTIME-complete 
in general, and DEXPTIME-complete for the Horn case [26,33]. So when abstracting to a 
polynomially decidable clause set (ground HBS) an exponential factor is unavoidable. 


Lemma 12. N is equisatisfiable to its hammered version treny (N) Utfacts(V,) Usfacts(N,£). 
The conjecture N - Ay.Q(¥) is false iff Np = tren), (N’) U tfacts(N’, 8) U sfacts(N’, B) 

is satisfiable with N’ = N U {Q(¥) —> L}. The conjecture N - Vy.Q(jy) is false iff 
Np =treny (N) Utfacts(V,8) Usfacts(NV,8)UNc is satisfiable. 


Note that treny (N) Utfacts(N,8) Usfacts(V,8)U Nc is only a HBS clause set over a 
finite set of constants and not yet a Datalog input file. It is well known that such a formula 
can be transformed easily into a Datalog problem by adding a nullary predicate Goal and 
adding it as a positive literal to any clause without a positive literal. Querying for the Goal 
atom returns true if the HBS clause set was unsatisfiable and false otherwise. 


Example 13. The hammered formula for example 6 looks as follows. The set of renamed 
clauses treny (N) consists of all the previous clauses in N, except that inequalities have been 
abstracted to new first-order predicates: 

D; :SpeedTable(0,2000,1350), D4 :SpeedTable(2000,4000, 1600), 

D3 : Speed Table(4000,6000,1850), D}: SpeedTable(6000,8000,2100), 

Ci : P0<xp (Xp).Px, <8000 (xp) =? Speed(xp), 

Cy: Px <xp (11 Xp) Pp <x (Xp X2), Speed (xp ),SpeedTable(x ,x2,y) > IgnDeg(xp,y), 

C}: IgnDeg(xp,z) > ResArgs(xp), C4 :ResArgs(xp) > Conj(xp), 

Ci: Px,, >8000(Xp) > Conj(xp), CE: Px, <0(Xp) — Conj(xp), 

The set tfacts(V,8) defines for which test points those new predicates evaluate to true: 
{Posx, (0), Pox, (2000), Po<x, (4000), Po<x, (6000), Po<x, (8000), Px, <s000(-1), 
Px,,<8000(9), Px,,<g000(2000), Px,,<s000(4000), P,,<g000(6000), P., <x, (0,0), 

Px, <x, (0,2000), Px, <x, (0,4000), Px; <x, (0,6000), Px, <x, (0,8000), Px, <x,, (2000,2000), 
Px, <x, (2000,4000), Px; <x, (2000,6000), Px, <x, (2000,8000), Px, <x, (4000,4000), 

Px, <x, (4000,6000), Px; <x, (4000,8000), Px, <x, (6000,6000), Px, <x, (6000,8000), 
Px,,<x)(—1,2000), Px,,<x,(0,2000), Px,,<x,(—1,4000), Px, <x, (0,4000), 

P xp <x (2000,4000), Px,, <x (—1,6000), Px,,<x,(0,6000), Px, <x, (2000,6000), 
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Problem|Q]Status|| |N ||vars||B”"|| [Ag] |SSPL]||B*| IAG | SSPLO6||vampire|spacer| z3 | cvc4 
c_el g| true | 139] 9 9 O}< O.1s|} 45 0l <O.1s]| <0.1s|<0.1s] 0,1|< 0.1s 
c_e2 |d| false |144| 9 9 0< 0.1s|| 41 0} <0.ls|| <0.lsi< 0.1s - - 
c e3 |d| false |138| 9 9 0< 0.ls|| 37 0 <0.ls|| <0.1si< 0.1s - - 
c e4 |d| true |137| 9 9 0< 0.1s|} 49 0l <0.ls|| <0.1s|< 0.1s| < 0.1s|< 0.1s 
c_e5 |d| false |152|13 9 0| 33.5s - - N/A|| <0.1s - - - 
c_e6 |S] true |141|13 9 0) 42.8s - - N/A 0.1s| 3.3s| 11.5s} 0.4s 
lc_e7 |S] false | 141] 13 9 0) 414s - - N/A|| <0.1s} 7.65 - - 
lc_e8 |d| false | 141] 13 9 0} 32.5s - - N/A|| <0.1s} 2.15 - - 
cul |V) false |139| 9 9 27\< 0.1s|} 45 27| <0.1s|| <0.ls]| N/A - - 
ic_u2 |V) false | 144] 9 9 27\< 0.1s|} 41 27| <@.ls|| <0.1s} N/A - - 
c_u3 |Y true |138| 9 9 27\< 0.1s|| 37 27| <0.ls|| <0.1s} N/A] < 0.1s|< 0.1s 
c_u4 |YÍ false | 137] 9 9 27|< 0.1s|| 49 27| <0.1s|| <0.ls| N/A - - 
c_u5 |Y] false | 154] 13 9 3888] 32.4s - - N/A 0.1s}| N/A - - 
c_u6 |Y] true | 154] 13 9| 3888] 32.5s - -| NA 2.3s| N/A - - 
c u7 |Y| true 1141] 13 9 972| 32.3s - - N/A 0.2s) N/A - - 
c_u8 |Y] false | 141] 13 9]1259712] 48.8s - -| N/AJ]2351.4s] N/A - - 
ecu_el |3| false |757] 10 96 O}< 0.1s|| 624 0 1.3s 0.2s) O.1s - - 
ecu_e2 |J| true [757] 10 96 O}< 0.1s|} 624 0 1.3s 0.2s) O.1s} 1.48] 0.4s 
ecu_e3 |S) false |775] 11 196 0} 50.1s]} 660 0} 41.5s 3.1s] O.1s - - 
ecu_ul |V| true [756] 11 96 37| 0.1s|} 620 306 1.is 32.88] N/A|197.5s| 0.4s 
ecu_u2 |V) false |756] 11 96 38] 0.1s|} 620 307 11s 32.88] N/A - - 
ecu_u3 |V| true |745| 9 88 760/< 0.1s|| 576| 11360 0.7s 1.2s| N/A]239.5s] 0.15 
ecu_u4 |V| true |745| 9 486 760|< 0.1s||2144| 237096} 15.9s 1.2s| N/A|196.0s| 0.1s 
ecu_u5 |Y| true [767] 10 96; 3900| 0.1s|] 628] 415296] 31.9s -| N/A - - 
ecu_u6 |V| false |755] 10 95| 3120|< 0.1s|| 616| 363584 14.4s/| 597.8) N/A - - 
ecu_u7 |Y] false |774]11 |} 196| 8400] 48.9s]} 656)2004708 - -| N/A - - 
ecu_u8 |Y| true |774]11 |} 196) 8400| 48.7s]} 656)2004708 - -| N/A - - 


Fig. 2. Benchmark results and statistics 


Px, <x, (4000,6000), Px, <x. (—1,8000), Px, <x, (0,8000), Px, <x, (2000,8000), 

Px „<x, (4000,8000), Px, <x, (6000,8000), Px, >800(8000), Px,,<0(-1)} 

sfacts( NB) =0 because there are no fresh sort predicates. The hammered negated conjecture 
is Nc :=Conj(—1), Conj(0), Conj(2000), Conj(4000), Conj(6000), Conj(8000) — L and 
lets us derive false if and only if we can derive Conj(a) for all test points a € B(Conj,1). 


4 Implementation and Experiments 


We have implemented the sorted Datalog hammer as an extension to the SPASS-SPL 
system [11] (option -d) (SSPL in the table). By default the resulting formula is then solved 
with the Datalog reasoner VLog. The previously file-based combination with the Datalog 
reasoner VLog has been replaced by an integration of VLog into SPASS-SPL via the VLog API. 
We focus here only on the sorted extension and refer to [11] for an introduction into coupling 
of the two reasoners. Note that the sorted Datalog hammer itself is not fine tuned towards 
the capabilities of a specific Datalog reasoner nor VLog towards the sorted Datalog hammer. 

In order to test the progress in efficiency of our sorted hammer, we ran the benchmarks 
of the lane change assistant and engine ECU from [11] plus more sophisticated, extended for- 
malizations. While for the ECU benchmarks in [11] we modeled ignition timing computation 
adjusted by inlet temperature measurements, the new benchmarks take also gear box protection 
mechanisms into account. The lane change examples in [11] only simulated the supervisor for 
lane change assistants over some real-world instances. The new lane change benchmarks check 
properties for all potential inputs. The universal ones check that any suggested action by a lane 
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change assistant is either proven as correct or disproven by our supervisor. The existential ones 
check safety properties, e.g., that the supervisor never returns both a proof and a disproof for the 
same input. We actually used SPASS-SPL to debug a prototype supervisor for lane change as- 
sistants during its development. The new lane change examples are based on versions generated 
during this debugging process where SPASS-SPL found the following bugs: (i) it did not always 
return aresult, (ii) it declared actions as both safe and unsafe at the same time, and (iii) it declared 
actions as safe although they would lead to collisions. The supervisor is now fully verified. 
The names of the problems are formatted so the lane change examples start with Ic and 
the ECU examples start with ecu. Our benchmarks are prototypical for the complexity of 
HBS(SLA) reasoning in that they cover all abstract relationships between conjectures and 
HBS(SLA) clause sets. With respect to our two case studies we have many more examples 
showing respective characteristics. We would have liked to run benchmarks from other sources, 
but could not find any problems in the SMT-LIB [5,35] or CHC-COMP [2] benchmarks 
within the range of what our hammer can currently accept. Either the arithmetic part goes 
beyond SLA or there are further theories involved such as equality on first-order symbols. 


For comparison, we also tested several state-of-the-art theorem provers for related logics 
(with the best settings we found): SPASS-SPL-v0.6 (SSPLO6 in the table) that uses the original 
version of our Datalog Hammer [11] with settings -d for existential and -d -n for universal 
conjectures; the satisfiability modulo theories (SMT) solver cvc4-/.8 [4] with settings --multi- 
trigger-cache --full-saturate-quant; the SMT solver z3-4.8.12 [28] with its default 
settings; the constrained horn clause (CHC) solver spacer [24] with its default settings; and the 
first-order theorem prover vampire-4.5.1 [37] with settings --memory_limit 8000 -p off, 
i.e., with memory extended to 8GB and without proof output. For the SMT/CHC solvers, we 
directly transformed the benchmarks into their respective formats. Vampire gets the same input 
as VLog transformed into the TPTP format [39]. Our experiments with vampire investigate how 
superposition reasoners perform on the hammered benchmarks compared to Datalog reasoners. 


For the experiments, we used the TACAS 22 artifact evaluation VM (Ubuntu 20.04 with 
8 GB RAM and a single processor core) on a system with an Intel Core i7-9700K CPU with 
eight 3.60GHz cores. Each tool got a time limit of 40 minutes for each problem. 


The table in Fig. 2 lists for each benchmark problem: the name of the problem (Problem); 
the type of conjecture (Q), i.e., whether the conjecture is existential 3 or universal V; the status 
of the conjecture (Status); number of clauses (|N|); maximum number of variables in a clause 
(vars); the size of the largest test-point set introduced by the sorted/original Hammer (B*/B°); 
the size of the hammered universal conjecture (|Ag|/ IAG for sorted/original); the remaining 
columns list the time needed by the tools to solve the benchmark problems. An entry "N/A" 
means that the benchmark example cannot be expressed in the tools input format, e.g., it is not 
possible to encode a universal conjecture (or, to be more precise, its negation) in the CHC format 
and SPASS-SPL-v0.6 is not sound when the problem contains integer variables. An entry "-" 
means that the tool ran out of time, ran out of memory, exited with an error or returned unknown. 

The experiments show that SPASS-SPL (with the sorted Hammer) is orders of magnitudes 
faster than SPASS-SPL-v0.6 (with the original Hammer) on problems with universal con- 
jectures. On problems with existential conjectures, we cannot observe any major performance 
gain compared to the original Hammer. Sometimes SPASS-SPL-v0.6 is even slightly faster 
(e.g. ecu_e3). Potential explanations are: First, the number of test points has a much larger 
impact on universal conjectures because the size of the hammered universal conjecture 
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increases exponentially with the number of test points. Second, our sorted Hammer needs to 
generate more abstracted theory facts than the original Hammer because the latter can reuse 
abstraction predicates for theory atoms that are identical upto variable renaming. The sorted 
Hammer can reuse the same predicate only if variables also range over the same sets of test 
points, which we have not yet implemented. 

Compared to the other tools, SPASS-SPL is the only one that solves all problems in 
reasonable time. It is also the only solver that can decide in reasonable time whether a universal 
conjecture is not a consequence. This is not surprising because to our knowledge SPASS-SPL 
is the only theorem prover that implements a decision procedure for HBS(SLA). On the 
problems with existential conjectures, our tool-chain solves all of the problems in under 
a minute and with comparable times to the best tool for the problem. The only exception 
are problems that contain a lot of superfluous clauses, i.e., clauses that are not needed to 
confirm/refute the conjecture. The reason might be that VLog derives all facts for the input 
problem in a breadth-first way, which is not very efficient if there are a lot of superfluous 
clauses. Vampire coupled with our sorted Hammer returns the best results for those problems. 
Vampire performed best on the hammered problems among all first-order theorem provers we 
tested, including iProver [25], E [38], and SPASS [40]. We tested all provers in default theorem 
proving mode with adjusted memory limits. The experiments with the first-order provers 
showed that our hammer also works reasonably well for them, but they do not scale well if the 
size and the complexity of the universal conjectures increases. For problems with existential 
conjectures, the CHC solver spacer is often the best, but as a trade-off it is unable to handle 
universal conjectures. The instantiation techniques employed by cvc4 are good for proving 
some universal conjectures, but both SMT solvers seem to be unable to disprove conjectures. 


5 Conclusion 


We have presented an extension of our previous Datalog hammer [11] supporting a more 
expressive input logic resulting in more elegant and more detailed supervisor formalizations, 
and through a soft typing discipline supporting more efficient reasoning. Our experiments 
show, compared to [11], that our performance on existential conjectures is at the same level 
as SMT and CHC solvers. The complexity of queries we can handle in reasonable time has 
significantly increased, see Section 4, Figure 2. Still SPASS-SPL is the only solver that can 
prove and disprove universal queries. The file interface between SPASS-SPL and VLog has 
been replaced by a close coupling resulting in a more comfortable application. 

Our contribution here solves the third point for future work mentioned in [11] although 
there is still room to also improve our soft typing discipline. In the future, we want SPASS-SPL 
to produce explications that prove that its translations are correct. Another direction is to exploit 
specialized Datalog expressions and techniques, e.g., aggregation and stratified negation, to 
increase the efficiency of our tool-chain and to lift some restrictions from our input formulas. 
Finally, our hammer can be seen as part of an overall reasoning methodology for the class 
of BS(LA) formulas which we presented in [12]. We will implement and further develop this 
methodology and integrate our Datalog hammer. 
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Abstract. We propose a semi-decision procedure for checking general- 
ized reachability properties, on generalized Petri nets, that is based on 
the Property Directed Reachability (PDR) method. We actually define 
three different versions, that vary depending on the method used for ab- 
stracting possible witnesses, and that are able to handle problems of in- 
creasing difficulty. We have implemented our methods in a model-checker 
called SMPT and give empirical evidences that our approach can handle 
problems that are difficult or impossible to check with current state of 
the art tools. 


Keywords: Petri nets - Model Checking - Reachability - SMT solving 


1 Introduction 


We propose a new semi-decision procedure for checking reachability properties 
on generalized Petri nets, meaning that we impose no constraints on the weights 
of the arcs and do not require a finite state space. We also consider a generalized 
notion of reachability, in the sense that we can not only check the reachability of 
a given state, but also if it is possible to reach a marking that satisfies a combina- 
tion of linear constraints between places, such as (pp +p1 = p2+2)A(p1 < p2) for 
example. Another interesting feature of our approach is that we are able to re- 
turn a “certificate of invariance”, in the form of an inductive linear invariant [26], 
when we find that a constraint is true on all the reachable markings. To the best 
of our knowledge, there is no other tool able to compute such certificates in the 
general case. 

Our approach is based on an extension of the Property Directed Reachability 
(PDR) method, originally developed for hardware model-checking [8,9], to the 
case of Petri nets. We actually define three variants of our algorithm—two of 
them completely new when compared to our previous work [1]—that vary based 
on the method used for generalizing possible witnesses and can handle problems 
of increasing difficulty. 

Reachability for Petri nets is an important and difficult problem with many 
practical applications: obviously for the formal verification of concurrent sys- 
tems, but also for the study of diverse types of protocols (such as biological or 
business processes); the verification of software systems; the analysis of infinite 
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state systems; etc. It is also a timely subject, as shown by recent publications on 
this subject [7,15], but also with the recent progress made on settling its theoret- 
ical complexity [12,13], which asserts that reachability is Ackermann-complete, 
and therefore inherently more complex than, say, the coverability problem. A 
practical consequence of this “inherent complexity”, and a general consensus, is 
that we should not expect to find a one-size-fits-all algorithm that could be us- 
able in practice. A better strategy is to try to improve the performances on some 
cases—for example by developing new tools, or optimizations, that may perform 
better on some examples—or try to improve “expressiveness’”—by finding algo- 
rithms that can manage new cases, that no other tool can handle. 

This wisdom is illustrated by the current state of the art at the Model Check- 
ing Contest (MCC) [3], a competition of model-checkers for Petri nets that in- 
cludes an examination for the reachability problem. Albeit strongly oriented 
towards the analysis of bounded nets. As a matter of fact, the top three tools 
in recent competitions—ITS-Too s [30], LOLA [34], and TAPAAL [14]—all rely 
on a portfolio approach. Methods that have been proposed in this context in- 
clude the use of symbolic techniques, such as k-induction [31]; abstraction re- 
finement [10]; the use of standard optimizations with Petri nets, like stubborn 
sets or structural reductions; the use of the “state equation”; reduction to integer 
linear programming problems; etc. 

The results obtained during the MCC highlight the very good performances 
achieved when putting all these techniques together, on bounded nets, with a col- 
lection of randomly generated properties. Another interesting feedback from the 
MCC is that simulation techniques are very good at finding a counter-example 
when a property is not an invariant [7,31]. 

In our work, we seek improvements in terms of both performance and ez- 
pressiveness. We also target what we consider to be a difficult, and less studied 
area of research: procedures that can be applied when a property is an invariant 
and when the net is unbounded, or its state space cannot be fully explored. We 
also focus on the verification of “genuine” reachability constraints, which are not 
instances of a coverability problem. These properties are seldom studied in the 
context of unbounded nets. Interestingly enough, our work provides a simple 
explanation of why coverability problems are also “simpler” in the case of PDR; 
what we will associate with the notion of monotonic formulas. 

Concerning performances, we propose a method based on a well-tried sym- 
bolic technique, PDR, that has proved successful with unbounded model-checking 
and when used together with SMT solvers [11,22]. Concerning expressiveness, 
we define a small benchmark of “difficult nets”: a set of synthetic examples, 
representative of patterns that can make the reachability problem harder. 


Outline and Contributions. We define background material on Petri nets 
in Sect. 2, where we use Linear Integer Arithmetic (LIA) formulas to reason 
about nets. Section 3 describes our decision method, based on PDR and SMT 
solvers, for checking the satisfiability of linear invariants over the reachable states 
of a Petri net. Our method builds sequences of incremental invariants using 
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both a property that we want to disprove, and a stepwise approximation of 
the reachability relation. It also relies on a generalization step where we can 
abstract possible “bad states” into clauses that are propagated in order to find 
a counter-example, or to block inconsistent states. 

We describe a first generalization method, based on the upset of markings, 
that is able to deal with coverability properties. We propose a new, dual variant 
based on the concept of hurdles [21], that is without restrictions on the prop- 
erties. In this method, the goal is to block bad sequences of transitions instead 
of bad states. We show how this approach can be further improved by defin- 
ing a notion of saturated transition sequence, at the cost of adding universal 
quantification in our SMT problems. 

We have implemented our approach in an open-source tool, called SMPT, 
and compare it with other existing tools. In this context, one of our contributions 
is the definition of a set of difficult nets, that characterizes classes of difficult 
reachability problems. 


2 Petri Nets and Linear Reachability Constraints 


Let N denote the set of natural numbers and Z the set of integers. Assuming P 
is a finite, totally ordered set {p1,...,Pn}, we denote by N? the set of mappings 
from P > N and we overload the addition, subtraction and comparison operators 
(=, >,<) to act as their component-wise equivalent on mappings. A QF-LIA 
formula F, with support in P, is a Boolean combination of atomic propositions 
of the form a ~ 8, where ~ is one of =, < or > and a, 8 are linear expressions, 
that is, linear combinations of elements in NU P. We simply use the term linear 
constraint to describe F. 

A Petri net N is a tuple (P,T, pre, post) where P = {p1,...,p,} is a finite 
set of places, T is a finite set of transitions (disjoint from P), and pre : T > NP 
and post : T — N? are the pre- and post-condition functions (also called the 
flow functions of N). A state m of a net, also called a marking, is a mapping of 
N?. We say that the marking m assigns m(p;) tokens to place p;. A marked net 
(N,mo) is a pair composed from a net and an initial marking mo. 

A transition t € T is enabled at marking m € NP when m > pre(t). When 
t is enabled at m, we can fire it and reach another marking m’ € N? such that 
m’ = m — pre(t) + post(t). We denote this transition m — m’. The difference 
between m and m’ is a mapping A(t) = post(t) — pre(t) in ZP, also called the 
displacement of t. 

By extension, we say that a firing sequence o = tı ... tk E€ T* can be fired 
from m, denoted mm’, if there exist markings mo,...,m, such that m = mo, 
m = mz and m; == Mi+ı for alli < k. We can also simply write m —>* m’. In 
this case, the displacement of ø is the mapping A(o) = A(ti) +--+ A(t). We 
denote by R(N, mo) the set of markings reachable from mo in N. A marking m 
is k-bounded when each place has at most k tokens. By extension, we say that 
a marked Petri net (N, mo) is bounded when there is k such that all reachable 
markings are k-bounded. 
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Fig. 1. Two examples of Petri nets: Parity (left) and PGCD (right). 


While reachable states are computed by adding a linear combination of “dis- 
placements” (vectors in Z?), the set R(N,mo) is not necessarily semilinear or, 
equivalently, definable using Presburger arithmetic [20,26]. This is a consequence 
of the constraint that transitions must be enabled before firing. But there is still 
some structure to the set R(N, mo), like for instance the following monotonicity 
constraint: 


Ym ENP. m me implies mı +m=m2+m (H1) 


We have other such results, such as with the notion of hurdle [21]. Just as 
pre(t) is the smallest marking for which a given transition t is enabled, there is 
a smallest marking at which a given firing sequence ø is fireable. This marking, 
denoted by H(c), has a simple inductive definition: 


H(t)=pre(t) and H(o,-02) = max(H(o1), H(o2) — A(oi)) (H2) 


Given this notion of hurdles, we obtain that m => m’ if and only if (1) the 
sequence ø is enabled: m > H(o), and (2) m’ = m + A(c). We use this result in 
the second variant of our method. 

We can go a step further and characterize a necessary and sufficient condition 
for firing the sequence c.o}, meaning firing the same sequence more than once. 
Given A(o), a place p with a negative displacement (say —d) means that we 
“loose” d token each time we fire ø. Hence we should budget d tokens in p 
for each new iteration. Therefore we have m 5 =m if and only if (1) m > 
H(o) + k - max(0,—A(c)), and (2) m’ = m+ (k + 1) - A(c). Equivalently, if 
we denote by m* the “positive” part of mapping m, such that m*(p) = 0 when 
m(p) < 0 and m*(p) = m(p) otherwise, we have: 


H(o*+) = max (H (0), H(c) —k- A(o)) = H(c) +k-(—A(o))+ (B3) 


Examples. We give two simple examples of unbounded nets in Fig. 1, which 
are both part of our benchmark. Parity has a single place, hence its state space 
can be interpreted as a subset of N: with an initial marking of 1, this is exactly 
the set of odd numbers (and therefore state 0 is not reachable). We are in a 
special case where the set R(N, mo) is semilinear. For instance, it can be seen 
as solution to the constraint 3k.(p = 2k + 1), or equivalently p = 1 (mod 2). 
But it cannot be expressed with a linear constraint involving only the variable 
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p without quantification or modulo arithmetic. This example can be handled by 
most of the tools used in our experiments, e.g. with the help of k-induction. 

In PGCD, transitions to /t; can decrement /increment the marking of po by 1. 
Nonetheless, with this initial state, it is the case that the number of occurrences 
of to is always less than the one of tı in any feasible sequence a. Hence the two 
predicates pp > 2 and p2 > pı are valid invariants. (Since some tools do not 
accept literals of the form p > q, we added the “redundant” place p3 so we can 
restate our second invariant as p3 > 1.) These invariants cannot be proved by 
reasoning only on the displacements of traces (using the state equation) and are 
already out of reach for LOLA or TAPAAL. 


Linear Reachability Formulas. We can revisit the semantics of Petri nets 
using linear predicates. In the following, we use p for the vector (p1,...,Dn), 
and F(p) for a formula with variables in P. We also simply use F(a) for the 
substitution F{p; + a1}... {Pn <— an}, with a = (a1,...,@n) a sequence of 
linear expressions. We say that a mapping m of N? is a model of F, denoted 
m = F, if the ground formula F(m) = F(m/(p1),...,m(pn)) is true. Hence 
we can also interpret F as a predicate over markings. Finally, we define the 
semantics of F as the set [F] = {m € N? | m H F}. 

As usual, we say that a predicate F is valid, denoted = F, when all its 
interpretations are true ([F] = N?); and that F is unsatisfiable (or simply 
unsat), denoted F F, when [F] = 4. 

We can define many properties on the markings of a net N using this frame- 
work. For instance, we can model the set of markings m such that some transition 
t is enabled using predicate ENBL; (see Equation (2) below). We can also define 
a linear predicate to describe the relation between the markings before and after 
some transition t fires. To this end, we use a vector p’ of “primed variables” 
(p1, ---;Ph), where p; will stand for the marking of place p; after a transition 
is fired. With this convention, formula FIRE; (p, p’) is such that FIRE;(m, m’) 
entails m Š m’ or m = m’ when t is enabled at m. With all these notations, 
we can define a predicate T(p, p’) that “encodes” the effect of firing at most one 
transition in the net N. 


GEQn(P) = Mern (Pi > mpi) (1) 
ENBL,(p) © Mein (Pi > pre(t)(pi)) = GEQu (4 (P) (2) 
Alp, p’) ¥ Nici.n (Pi = Pi + post(t)(pi) — pre(t)(p:)) (3) 
EQ(p p) = Mern (Pi = Pi) (4) 
FIRE,(p,p’) = EQ(p,p’) V (ENBL: (p) A 4 (p, p")) (5) 
T(p, p’) = EQ(p,p’) V Vier (ENBLi(p) A Ai(p, p")) (6) 


In our work, we focus on the verification of safety properties on the reachable 
markings of a marked net (N, mo). Examples of properties that we want to check 
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include: checking if some transition t is enabled (commonly known as quasi- 
liveness); checking if there is a deadlock; checking whether some linear invariant 
between place markings is true; ... All properties that can be expressed using a 
linear predicate. 


Definition 1 (Linear Invariants and Inductive Predicates). 

A linear predicate F is an invariant on (N,mo) if and only if we have m = F 
for allm € R(N, mo). It is inductive if for all markings m we have m = F and 
m— m entails m = F. 


It is possible to characterize inductive predicates using our logical framework. 
Indeed, F is inductive if and only if the QF-LIA formula (i) F(p) A T(p, p’) A 
=F (p’) is unsat. Also, an inductive formula is an invariant when (ii) mo } F, or 
equivalently = Fm). As a consequence, a sufficient condition for a predicate 
F to be invariant is to have both conditions (i) and (ii); conditions that can 
be checked using a SMT solver. Unfortunately, the predicates that we need to 
check are often not inductive. In this case, the next best thing is to try to build 
an inductive invariant, say R, such that |R] C [F] (or equivalently RA AF 
unsat). This predicate provides a certificate of invariance that can be checked 
independently. 


Lemma 1 (Certificate of Invariance). A sufficient condition for F to be 
invariant on (N,mo) is to exhibit a linear predicate R that is (i) initial: R(mo) 
valid; (ii) inductive: R(p) \T(p,p’) \7R(p ) unsat; and (iii) that entails F, 
for instance: RA AF unsat. 


This result is in line with a property proved by Leroux [26], which states 
that when a final configuration m is not reachable there must exist a Presburger 
inductive invariant that contains mo but does not contain m. This result does 
not explain how to effectively compute such an invariant. Moreover, in our case, 
we provide a method that works with general linear predicates, and not only 
with single configurations. On the other side of the coin, given the known results 
about the complexity of the problem, we do not expect our procedure to be 
complete in the general case. 

In the next section, we show how to (potentially) find such certificates using 
an adaptation of the PDR method. An essential component of PDR is to abstract 
a “scenario” leading to the model of some property F—say a transition m > m’ 
with m’ = F—into a predicate that contains m (and potentially many more 
similar scenarios). More generally, a generalization of the trio (m,o,F) is a 
predicate G satisfied by m such that mı = G entails that there is mı >* m2 
with ma = F. 

We can use properties (H1)—(H3), defined earlier, to build generalizations. 


Lemma 2 (Generalization). Assume we have a scenario such that m => m’ 
and m’ = F. We have three possible generalizations of the trio (m, ø, F). 


1 


(G1) If property F is monotonic, then mı = GEQ,,,(p) implies there is mz > m 
such that mi => m and mo = F. 
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(G2) fmi E GEQ z(o) (p) \F (p+ A(c)) then mı = m and m = F. 
(G3) Assume a,b are mappings of NP such that a = H(c) and b = (—A(o))*, 
with the notations used in (H3). Then 


k+1 
La [Niet..n(Pi 2 a(t) + k- b(i))] REAT Jk.mı ==> m2 
m Fo ( AF (p + (k +1) - A(o)) implies ndma = 
Proof. Each property is a direct result of properties (H1) to (H3). o 


Property (G3) is the first and only instance of linear formula using an extra 
variable, k, that is not in P. The result is still a linear formula though, since we 
never need to use the product of two variables. This generalization is used when 
we want to “saturate the sequence o”. This is the only situation where we may 
need to deal with quantified LIA formulas. Another solution would be to replace 
each quantification with the use of modulo arithmetic, but this operation may 
be costly and could greatly increase the size of our formulas. It would also not 
cut down the complexity of the SMT problems. 


3 Property Directed Reachability 


Some symbolic model-checking procedure, such as BMC [6] or k-induction [28], 
are a good fit when we try to find counter-examples on infinite-state systems. 
Unfortunately, they may perform poorly when we want to check an invariant. 
In this case, adaptations of the PDR method [8,9] (also known as IC3, for “In- 
cremental Construction of Inductive Clauses for Indubitable Correctness”) have 
proved successful. 

We assume that we start with an initial state mo satisfying a linear property, 
I, and that we want to prove that property P is an invariant of the marked net 
(N,mo). (We use blackboard bold symbols to distinguish between parameters 
of the problem, and formulas that we build for solving it.) We define F = ~P 
as the “set of feared events”; such that P is not an invariant if we can find m 
in R(N,mo) such that m = F. To simplify the presentation, we assume that 
is a conjunction of literals (a cube), meaning that P is a clause. In practice, we 
assume that F is in Disjunctive Normal Form. 

PDR is a combination of induction, over-approximation, and SAT or SMT 
solving. The goal is to build an incremental sequence of predicates Fo,..., Fy 
that are “inductive relative to stepwise approximations”: such that m — F; and 
m —> m' entails m’ H Fj11, but not m’ H F. The method stops when it finds a 
counter-example, or when we find that one of the predicates F; is inductive. 

We adapt the PDR approach to Petri nets, using linear predicates and SMT 
solvers for the QF-LIA and LIA logics in order to learn, generalize, and propagate 
new clauses. The most innovative part of our approach is the use of specific 
“generalization algorithms” that take advantage of the Petri nets theory, like the 
use of hurdles for example. Our implementation follows closely the algorithm for 
IC3 described in [9] and, for the sake of brevity, we only give the pseudo-code 
for the four main functions. 
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Function prove(I, F: linear predicates) 
Result: L if F is reachable (P = —F is not an invariant), otherwise T 


1 if sat (I(p) AT(p, p’) \F(p’)) then 
2 | return L 


k + 1, Fo & I, Fı + P 
while T do 
if not strengthen(k) then 
| return L 
propagateClauses (k) 
if CL(F;) = CL(Fi41) for some 1 <i<k then 
| return T 
kik+1 


omMmrNtIoantsb Ww 


m 
© 


The main function, prove, computes an Over Approximated Reachability Se- 
quence (OARS) (Fo, ..., Fp) of linear predicates, called frames, with variables 
in p. An OARS meets the following constraints: (1) it is monotonic: F; A ~Fj41 
unsat for 0 < i < k; (2) it contains the initial states: I A =~Fọ unsat; (3) it 
does not contain feared states: F; A F unsat for 0 < i < k; and (A) it satisfies 
consecution: F;(p) ^A T(p, p’) A aFi41(p’) unsat for 0 < i < k. 

By construction, each frame F; in the OARS is defined as a set of clauses, 
CL(F;), meaning that F; is built as a formula in CNF: F; = Naeccrer,) cl We 
also enforce that CL(Fi+1) C CL(F;) for 0 < i < k, which means that the 
monotonicity property between frames is trivially ensured. 

The body of function prove contains a main iteration (line 4) that increases 
the value of k (the number of levels of the OARS). At each step, we enter a 
second, minor iteration (line 2 in function strengthen), where we generate new 
minimal inductive clauses that will be propagated to all the frames. Hence both 
the length of the OARS, and the set of clauses in its frames, increase during 
computation. The procedure stops when we find an index 7 such that F; = Fi+1. 
In this case we know that F; is an inductive invariant satisfying P. We can also 
stop during the iteration if we find a counter-example (a model m of F). In this 
case, we can also return a trace leading to m. 

When we start the first minor iteration, we have k = 1, Fp = I and F; = P. 
If we have F(p) \T(p, p’) \F(p) unsat, it means that P is inductive, so we can 
stop and return that P is an invariant. Otherwise, we proceed with the strengthen 
phase, where each model of F(p) AT (p, p’) \F(p) becomes a potential counter- 
example, or witness, that we need to “block” (line 3-5 of function strengthen). 

Instead of blocking only one witness, we first generalize it into a predicate 
that abstracts similar dangerous states (see the call to generalizeWitness). 
This is done by applying one of the three generalization results in Lemma 2. We 
give more details about this step later. By construction, each generalization is a 
cube s (a conjunction of literals). Hence, when we block it, we learn new clauses 
from ~s that can be propagated to the previous frames. 
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Function strengthen(& : current level) 


1 try: 

2 while (m Š m’) | F(p) \T(p, p’) A F(p’) do 
3 s + generalizeWitness(m, t, F) 

4 n + inductivelyGeneralize(s, k - 2, k) 

5 pushGeneralization( {(s, n+1)}, k) 

6 return T 


NI 


catch counter example: 
| return | 


@ 


Function inductivelyGeneralize(s : cube, min: level, k: level) 


1 if min <0 and sat(Fo(p) AT (p, p’) A s(p’)) then 
2 | raise Counterexample 

3 for i + max(1,min+ 1) tok do 

4 if sat (Fi(p) AT(p, p’) A 78(p) A s(p’)) then 

5 generateClause(s, 7-1, k) 

6 return i — 1 

7 generateClause(s, k, k) 

8 return k 


Before pushing a new clause, we test whether s is reachable from previous 
frames. We take advantage of this opportunity to find if we have a counter- 
example and, if not, to learn new clauses in the process. This is the role of 
functions pushGeneralization and inductivelyGeneralize. 


We find a counter example (in the call to inductivelyGeneralize) if the 
generalization from a witness found at level k, say s, reaches level 0 and Fo(p) A 
T(p, p’) ^ s(p’) is satisfiable (line 1 in inductivelyGeneralize). Indeed, it 
means that we can build a trace from I to F by going through F),..., Fy. 


The method relies heavily on checking the satisfiability of linear formulas in 
QF-LIA, which is achieved with a call to a SMT solver. In each function call, we 
need to test if predicates of the form F; AT AG are unsat and, if not, enumerate 
its models. To accelerate the strengthening of frames, we also rely on the unsat 
core of properties in order to compute a minimal inductive clause (MIC). 


Our approach is parametrized by a generalization function (generalizeWit- 
ness) that is crucial if we want to avoid enumerating a large, potentially un- 
bounded, set of witnesses. This can be the case, for example, in line 5 of pushGe- 
neralization. In this particular case, we find a state m at level n (because 
m | Fn), and a transition t that leads to a problematic clause in Fa+1. There- 
fore we have a sequence ø of size k — n + 1 such that m & m’ and m’ H F. We 
consider three possible methods for generalizing the trio (m, o, F), that corre- 
sponds to property (G1)-(G3) in Lemma 2. 
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Function pushGeneralization(states: set of (state, level), k: level) 


1 while T do 

(s,n) + from states minimizing n 

if n > k then 

| return 

if (m Š m’) K Fa (p) AT(p, p’) A s(p’) then 
p + generalizeWitness(m, t, s) 
l + inductivelyGeneralize(p, n - 2, k) 
states + states U {(p,l + 1)} 

else 


OMAN OTp WN 


m 
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l + inductivelyGeneralize(s, n, k) 
states + states \ {(s,n)} U {(s, 1 + 1)} 


m 
jar 


State-based Generalization. A special case of the reachability problem is 
when the predicate F is monotonic,, meaning that mı = F entails mı + Mm H F 
for all markings m1, m2. A sufficient (syntactic) condition is for F to be a positive 
formula with literals of the form >),-;p; > a. This class of predicates coincide 
with what is called a coverability property, for which there exists specialized 
verification methods (see e.g. [18,19]). 


By property (G1), If we have to block a witness m such that m 5 m’ and 
m’ = F, we can as well block all the states greater than m. Hence we can 
choose the predicate GEQ,,, to generalize m. This is a very convenient case for 
verification and one of the optimizations used in previous works on PDR for 
Petri nets [1,16,23,24]. First, the generalization is very simple and we can easily 
compute a MIC when we block predicate GEQ,,, in a frame. Also, we can prove 
the completeness of the procedure when F is monotonic. An intuition is that it 
is enough, in this case, to check the property on the minimal coverability set 
of the net, which is always finite [18]. The procedure is also complete for finite 
transition systems. These are the only cases where we have been able to prove 
that our method always terminates. 


Transition-based Generalization. We propose a new generalization based 
on the notion of hurdles. This approach can be used when F is not monotonic, 
for example when we want to check an invariant that contains literals of the 
form p = k (e.g. the reachability of a fixed marking) or p > q. 

Assume we need to block a witness of the from m&m’ H s. Typically, s is a 
cube in F, or a state resulting from a call to pushGeneralization. By property 
(G2), we can as well block all the states satisfying G,(p) = GEQ io) (p)\s(pt+ 
A(c)). This generalization is interesting when property s does not constraint all 
the places, or when we have few equality constraints. In this case G, may have 
an infinite number of models. It should be noted that using the duality between 
“feasible traces” and hurdles is not new. For example, it was used recently [19] 
to accelerate the computation of coverability trees. Nonetheless, to the best of 
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our knowledge, this is the first time that this generalization method has been 
used with PDR. 


Saturated Transition-based Generalization. We still assume that we start 
from a witness m=>m’ H s. Our last method relies on property (G3) and allows 
us to consider several iterations of o. If we fix the value of k, then a possible 
generalization is GE (Ac, (pi > ali) + k W(i))) As(p+(k+1)-A(o)), where 
a,b are the mappings of N? defined in Lemma 2. (Notice that G} = Go.) More 
generally the predicate GS* = G} v --- v G$ is a valid generalization for the 
witness (m, g, s), in the sense that if mı H GS* then there is a trace mı >* m2 
such that mz = s. At the cost of using existential quantification (and therefore 
a “top-level” universal quantification when we negate the predicate to block it 
in a frame), we can use the more general predicate G% E 5k.G*, which is still 
linear and has its support in P. 

We know examples of invariants where the PDR method does not terminate 
except when using saturation. A simple example is the net Parity, used as an 
example in Sect. 2, with the invariant P = (p > 1). In this case, F = =P = (p = 
0). Hence we are looking for witnesses such that m —* 0. The simplest example 
is 2-25 0, which corresponds to the “blocking clause” p 4 2. In this case, we 
have H(t2) = 2 and A(t2) = —2. Hence the transition-based generalization is 
(p > 2) A (p — 2 = 0) = (p = 2), which does not block new markings. At this 
point, we try to block (p = 0) V (p = 2). The following minor iteration of our 
method will consider the witness 4 22> 0, etc. Hence after k minor iterations, 
we have Fp = (p # 0) A (p 4 2)A---A (p F 2k). If we saturate t2, we find in one 
step that we should block Jk.(p — 2 - (k + 1) = 0). This is enough to prove that 
(p > 1) is an invariant as soon as the initial marking is an odd number. 

This example proves that PDR is not complete, without saturation, in the 
general case. We conjecture that it is also the case with saturation. Even though 
example Parity is extremely simple, it is also enough to demonstrate the limit 
of our method without saturation. Indeed, when we only allow unquantified 
linear predicates with variables in P, it is not possible to express all the possible 
semilinear sets in N?. (We typically miss some periodic sets.) In practice, it is not 
always useful to saturate a trace and, in our implementation, we use heuristics 
to limit the number of quantifications introduced by this operation. Actually, 
nothing prevents us from mixing our different kinds of generalization together, 
and there is still much work to be done in order to find good tactics in this case. 


4 Experimental Results 


We have implemented our complete approach in a tool, called SMPT (for Satis- 
fiability Modulo P/T Nets), and made our code freely available under the GPLv3 
license. The software, scripts and data used to perform our analyses are available 
on Github (htttps://github.com/nicolasAmat/SMPT) and are archived in Zen- 
odo [2]. The tool supports the declaration of reachability constraints expressed 
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Instance SMPT  ITS-Too.ts LoLA TAPAAL 
Murphy 0.75* TLE TLE TLE 
PGCD 0.11* 139.08 TLE TLE 
CryptoMiner 0.19 * 5.92 TLE 0.18 
Parity 0.40 * 3.36 0.01 4.16 


Process 83.39 TLE 0.03 0.18 


Table 1. Computation time on our synthetic examples (time in seconds). 


using the same syntax as in the Reachability examinations of the Model Check- 
ing Contest (MCC). For instance, we use PNML as the input format for nets. 
SMPT relies on a SMT solver to answer sat and unsat-core queries. It inter- 
acts with SMT solvers using the SMT-LIBv2 format, which is a well-supported 
interchange format. We used the Z3 solver for all the results presented in this 
section. 


Evaluation on Expressiveness. It is difficult to find benchmarks with un- 
bounded Petri nets. To quote Blondin et al. [7], “due to the lack of tools handling 
reachability for unbounded state spaces, benchmarks arising in the literature are 
primarily coverability instances”. It is also very difficult to randomly generate a 
true invariant that does not follow, in an obvious way, from the state equation. 
For this reason, we decided to propose our own benchmark, made of five syn- 
thetic examples of nets, each with a given invariant. This benchmark is freely 
available and presented as an archive similar to instances of problems used in 
the MCC. 

Our benchmark is made of deceptively simple nets that have been engineered 
to be difficult or impossible to check with current techniques. Our two first ex- 
amples are displayed in Fig. 1. We give another example in Fig. 2. Each example 
is quite small, with less than 10 places or transitions, and is representative of 
patterns that can make the reachability problem harder: the use of self-loops; 
dead transitions that cannot be detected with the state equation; weights that 
are relatively prime; etc. 

We compared SMPT against ITS-Too.ts, LOLA, and TAPAAL and give our 
results in Table 1. All results are computed using 4 cores, a limit of 16GB of 
RAM, and a timeout of 1h. A result of TLE stands for “Time Limit Exceeded”. 
For SMPT, we marked with an asterisk (*) the results computed using our 
saturation-based generalization. Our results show that SMPT is able to answer 
on several classes of examples that are out of reach for some, or all the other 
tools; often by orders of magnitude. 


Computing Certificate of Invariance. A distinctive feature of SMPT is the 
ability to output a linear inductive invariant for reachability problems: when we 
find that P is invariant, we are also able to output an inductive formula C, of 
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Fig. 2. Example Murphy, with invariant P = (pi < 2A p4 > ps). 


the form P A G, that can be checked independently with a SMT solver. We can 
find the same capability in the tool PETRINIZER [16] in the case of coverability 
properties. 

To get a better sense of this feature, we give the actual outputs computed with 
SMPT on the two nets of Fig. 1. The invariant for the net Parity is P4 = (po > 1), 
and for PGCD it is Pg = (pi < po) 

The certificate for property Pı on Parity is Cı = (po > 1) A Vk.((po < 
2k +2) V (po > 2k+3)), which is equivalent to (po > 1) A (Vk > 1).(po # 2.4), 
meaning the marking of po is odd. This invariant would be different if we changed 
the initial marking to an even number. 


[PDR] Certificate of invariance 
# (not (po < 1)) 
# (forall (k1) ((pO < (2 + (k1 * 2))) or (pO + (-2 * (k1 + 1))) >= 1)) 


The certificate for property Pg on PGCD is Cy = (pı < p2) A Vk.((po < 
k +3) V (po — pi > k +1)) and may seem quite inscrutable. It happens actually 
that the saturation “learned” the invariant po + pı = p2 + 2 and was able to use 
this information to strengthen property P2 into an inductive invariant. 


[PDR] Certificate of invariance 
# (not (pl > p2)) 
# (forall (k1) ((pO < (3 + (k1 * 1))) or ((p1 + (1 * (k1 + 1))) <= p2)) 


Evaluation on Performance. Since it is not sufficient to use only a small 
number of hand-picked examples to check the performance of a tool, we also 
provide results obtained on a set of 30 problems (a net together with an invariant) 
that are borrowed from test cases used by the tool SARA [32,33] and a similar 
software, called REACH, that is part of the TINA toolbox [5]. Most of these 
problems can be easily answered, but are interesting to test our reliability on a 
relatively even-handed benchmark. 

The experiments were performed with the same conditions as previously. We 
display our results in the chart of Fig. 3, which gives the number of feasible 
problems, for each tool, when we change the timeout value. We observe that 
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Fig. 3. Minimal timeout to compute a given number of queries. 


our performances are on par with TAPAAL, which is the fastest among our three 
reference tools on this benchmark. 

Our tool is actually quite mature. In particular, a preliminary version of 
SMPT [1] (without many of the improvements described in this work) partic- 
ipated in the 2021 edition of the MCC, where we ranked fourth, out of five 
competitors, and achieved a reliability in excess of 99.9%. Even if it was with 
a previous version of our tool, there are still lessons to be learned from these 
results. In particular, it can inform us on the behavior of SMPT on a very 
large and diverse benchmark of bounded nets, with a majority of reachability 
properties that are not invariants. 

We can compare our results with those of LOLA, that fared consistently well 
in the reachability category of the MCC. LOLA is geared towards model checking 
of finite state spaces, but it also implements semi-decision procedures for the 
unbounded case. Out of 45152 reachability queries at the MCC in 2021 (one 
instance of a net with one formula), LOLA was able to solve 85% of them (38 175 
instances) and SMPT only 52% (23 375 instances); it means approximately x 1.6 
more instances solved using LOLA than using SMPT. Most of the instances 
solved with SMPT have also been solved by LOLA; but still 1631 instances are 
computed only with our tool, meaning we potentially increase the number of 
computed queries by 4%. This is quite an honorable result for SMPT, especially 
when we consider the fact that we use a single technique, with only a limited 
number of optimizations. 


5 Conclusion and Related Works 


One of the most important results in concurrency theory is the decidability of 
reachability for Petri nets or, equivalently, for Vector Addition Systems with 
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States (VASS) [25]. Even if this result is based on a constructive proof, and 
its “construction” streamlined over time [26], the classical Kosaraju-Lambert- 
Mayr-Sacerdote-Tenney approach does not lead to a workable algorithm. It is 
in fact a feat that this algorithm has been implemented at all, see e.g. the tool 
KREACH [15]. While the (very high) complexity of the problem means that no 
single algorithm could work efficiently on all inputs, it does not prevent the 
existence of methods that work well on some classes of problems. For example, 
several algorithms are tailored for the discovery of counter-examples. We mention 
the tool FASTFORWARD [7] in our experiments, that explicitly targets the case 
of unbounded nets. 


We propose a method that works as well on bounded as on unbounded ones; 
that behaves well when the invariant is true; and that works with “genuine” 
reachability properties, and not only with coverability. But there is of course no 
panacea. Our approach relies on the use of linear predicates, which are incremen- 
tally strengthened until we find an invariant based on: the transition relation of 
the net; the property we want to prove (it is “property-directed”); and constraints 
on the initial states. This is in line with a property proved by Leroux [26], which 
states that when a final configuration is not reachable then “there exist check- 
able certificates of non-reachability in the Presburger arithmetic.” Our extension 
of PDR provides a constructive method for computing such certificates, when 
it terminates. For our future works, we would like to study more precisely the 
completeness of our approach and/or its limits. 


This is not something new. There are many tools that rely on the use of in- 
teger programming techniques to check reachability properties. We can mention 
the tool SARA [33], that is now integrated inside LOLA and can answer reach- 
ability problems on unbounded nets; or libraries like Fast [4], designed for the 
analysis of systems manipulating unbounded integer variables. An advantage of 
our method is that we proceed in a lazy way. We never explicitly compute the 
structural invariants of a net, never switch between a Presburger formula and 
its representation as a semilinear set (useful when one wants to compute the 
“Kleene closure” of a linear constraint), ...and instead let a SMT solver work 
its magic. 


We can also mention previous works on adapting PDR/IC3 to Petri nets. 
A first implementation of SMPT was presented in [1], where we focused on the 
integration of structural reductions with PDR. This work did not use our abstrac- 
tion methods based on hurdles and saturation, which are new. We can find other 
related works, such as [16,23,24]. Nonetheless they all focus on coverability prop- 
erties. Coverability is not only a subclass of the general reachability problem, it 
has a far simpler theoretical complexity (EXPSPACE vs NONELEMENTARY). 
It is also not expressive enough for checking the absence of deadlocks or for 
complex invariants, for instance involving a comparison between the marking 
of two places, such as p < q. The idea we advocate is that approaches based 
on the generalization of markings are not enough. This is why we believe that 
abstractions (G2) and (G3) defined in Lemma 2 are noteworthy. 


520 N. Amat et al. 


We can also compare our approach with tools oriented to the verification of 
bounded Petri nets; since many of them integrate methods and semi-decision 
procedures that can work in the unbounded case. The best performing tools in 
this category are based on a portfolio approach and mix different methods. We 
compared ourselves with three tools: ITS-Too ts [30], TAPAAL [14] and LOLA 
[34], that have in common to be the top trio in the Model Checking Contest [3]. 
(And can therefore accept a common syntax to describe nets and properties.) 
Our main contribution in this context, and one of our most complex results, is 
to provide a new benchmark of nets and properties that can be used to evaluate 
future reachability algorithms “for expressiveness”. 

The methods closest to ours in these portfolios are Bounded Model Check- 
ing and k-induction [28], which are also based on the use of SMT solvers. 
We can mention the case of ITS-Too ts [31], that can build a symbolic over- 
approximation of the state space, represented as set of constraints. This ap- 
proximation is enough when it is included in the invariant that we check, but 
inconclusive otherwise. A subtle and important difference between PDR and 
these methods is that PDR needs only 2n variables (the p and p’), whereas we 
need n fresh variables at each new iteration of k-induction (so kn variables in 
total). This contributes to the good performances of PDR since the complexity 
of the SMT problems are in part relative to the number of variables involved. 
Another example of over-approximation is the use of the so-called “state equation 
method” [27], that can strengthen the computations of inductive invariants by 
adding extra constraints, such as place invariants [29], siphons and traps [16,17], 
causality constraints, etc. We plan to exploit similar constraints in SMPT to 
better refine our invariants. 

To conclude, our experiments confirm what we already knew: we always ben- 
efit from using a more diverse set of techniques, and are still in need of new tech- 
niques, able to handle new classes of problems. For instance, we can attribute the 
good results of TAPAAL, in our experiments, to their implementation of a Trace 
Abstraction Refinement (TAR) techniques, guided by counter-examples [10]. The 
same can be said with LOLA, that also uses a CEGAR-like method [33]. We be- 
lieve that our approach could be a useful addition to these techniques. 


Acknowledgements. We would like to thank Alex Dixon, Philip Offtermatt 
and Yann Thierry-Mieg for their support when evaluating their respective tools. 
Their assistance was essential in improving the quality of our experiments. 
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Abstract. While model checking safety of infinite-state systems by in- 
ferring state invariants has steadily improved recently, most verification 
tools still rely on a technique based on bounded model checking to detect 
safety violations. In particular, the current techniques typically analyze 
executions by unfolding transitions one step at a time, and the slow 
growth of execution length prevents detection of deep counterexamples 
before the tool reaches its limits on computations. We propose a novel 
model-checking algorithm that is capable of both proving unbounded 
safety and finding long counterexamples. The idea is to use Craig inter- 
polation to guide the creation of symbolic abstractions of exponentially 
longer sequences of transitions. Our experimental analysis shows that on 
unsafe benchmarks with deep counterexamples our implementation can 
detect faulty executions that are at least an order of magnitude longer 
than those detectable by the state-of-the-art tools. 


Keywords: Model checking - Transition systems - Craig interpolation - 
Model-based projection. 


1 Introduction 


Model checking [17] is a very successful technique widely used for formal ver- 
ification of hardware and software. While its ultimate goal is to prove safety, 
the ability to discover and report counterexamples primarily contributes to its 
industrial success. The algorithm that paved the way for the adaptation in the 
industry, bounded model checking (BMC) [9], still remains one of the most suc- 
cessful techniques today for detecting counterexamples. A typical BMC algorithm 
searches for counterexamples reachable in a finite number of steps, and if nothing 
is found, it increases the search limits and restarts. This philosophy has been 
largely adopted by most modern model-checking algorithms based on reachability 
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analysis as one of the advantages of this approach is that it finds the shortest 
counterexample (if one exists). However, it also results in scalability issues. Specif- 
ically, in modern software systems, it is not uncommon that a program must 
iterate through a certain loop thousands of times (or more) before it reaches some 
error state. These deep counterexamples pose problems for reachability-based 
algorithms that rely on unrolling the bounds of the system’s transition relation 
one transition at a time. 


An important class of loops present in software systems are multi-phase 
loops [44]. A multi-phase loop, in short, is a loop with a conditional (branch) in 
its body such that the conditional exhibits a fixed number of phase transitions 
during the execution of the loop. A phase is a sequence of iterations during which 
the conditional has the same value. Multi-phase loops are notoriously challenging 
to analyze. When they are safe, they typically require disjunctive invariants. On 
the other hand, an unsafe multi-phase loop may admit only deep counterexamples 
if only later phases reveal the unsafe behavior. 


In this paper we present a novel model-checking algorithm that is able to 
find counterexamples of much greater depth than state-of-the-art algorithms. 
At the same time, it is able to prove system safe under certain conditions and 
is competitive also on a general set of benchmarks. We build upon the large 
body of work on SMT-based model checking [1,3,4,8,14,15,25,28,30,37,38] and 
use Craig interpolation [18,35] for computing abstractions. However, we shift the 
focus from state abstractions—which is the widespread approach—to transition 
abstractions [40]. 


Our algorithm works on transition systems and it builds a sequence of ab- 
stract relations that gradually summarize (in an over-approximating way) an 
increasing number of steps of the transition relation. One important feature 
is that the summarized number of steps increases exponentially, not linearly. 
Another important feature is that all the abstract relations are expressed only 
over state and next-state variables, i.e., they do not require multiple copies of 
state variables to capture multiple steps of the transition relation. This sequence 
of abstract relations is used to refute the existence of bounded reachability paths 
in the system. If existence of a path cannot be refuted in the current abstraction, 
either the abstraction is strengthened to refute such path, or the path is shown to 
be real. The precise mechanics of building and refining the sequence of abstract 
relations are explained in Section 4. Our experiments demonstrate that our 
algorithm improves the ability to detect deep counterexamples in the multi-phase 
loop programs up to two orders of magnitude compared to the state-of-the-art. 
Furthermore, it enables the detection of bugs left undiscovered by the other tools. 


The main contributions of the paper are the following: 


— A novel model-checking algorithm for safety properties of transition system 
based on a sequence of relations over-approximating exponentially increasing 
number of steps of transition relation. 


— Proof of correctness of the algorithm and its termination for unsafe systems. 
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— Implementation and experimental evaluation of the proposed algorithm dem- 
onstrating its capabilities of finding deep counterexamples in challenging 
benchmarks containing multi-phase loops. 


The rest of the paper is organized as follows. The necessary background is 
given in Section 2, and a motivating example is given in Section 3. Section 4 
describes our novel algorithm, and Section 5 presents the experimental results. 
We discuss the related work in Section 6 and conclude in Section 7. 


2 Background 


Safety problem We work with a standard symbolic representation of transition 
systems using the language of first-order logic. Given a set of variables X, we 
denote as X’ the primed copy of X, i.e., X’ = {x' | a € X}. X is a set of state 
variables and X’ is a set of nezt-state variables. The formulas are interpreted 
with respect to some background theory 7; in our examples and benchmarks we 
work with the theory of linear real or integer arithmetic (LRA and LIA in the 
terminology of satisfiability modulo theories (SMT) [6,7]). We say that a formula 
in the language of T over X is a state formula and a formula over X U X’ is a 
transition formula. We identify state formulas with a set of states where they 
hold and we freely move between these two representations. Similarly, we identify 
transition formulas with binary relations over the set of states. The identity 
relation Id(x, x’) corresponds to the transition formula x = x’. 

Transition system is a pair (Init, Tr) where Init is a state formula representing 
the initial states of the system and Tr is a transition formula representing the 
transition relation of the system. A safety problem is a triple (Init, Tr, Bad) 
where (Init, Tr) is a transition system and Bad is a state formula representing 
bad states. 

When we only need to distinguish state and next-state variables, but not the 
individual state variables, for simplicity we only use the lower-case x, x’ and not 
X, X’. These can be viewed as variables representing tuples. We also often need 
to refer to next-next-state variables, which we denote as x”. 

We use o to represent concatenation of relations. For example, given two 
relations Ri(xz,y) and Ro(y,z) then R = Rı o Rə is a relation over x, z such 
that R(a,z) <= dy: Rı(z,y) and Ro(y,z). In transition systems we can 
define relations that represent multiple steps of a transition relation. For example 
Tr? (a, 2") = Tr(a,2') o Tr(a’,x”) relates pair of states (s,t) such that t is 
reachable from s in exactly two steps of the transition relation Tr. We also write 
that (s,t) € Tr?. Existence of a counterexample (a path from some initial to 
some bad state) of a fixed length l can be encoded as a satisfiability check of 
formula 


Init(2@) A Tr(a a) A Tr(a, 2) A... Tre’), 2) A Bad(x™), 


where « is a state variable shifted i steps, “with i primes”. A satisfying 
assignment determines l + 1 states such that the first one is an initial state, the 
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last one is a bad state, and each successor can be reached from its predecessor by 
one step of the transition relation Tr. If there is no satisfying assignment then 
no path of l steps from Init to Bad exists. 


Craig interpolation [18] Given an unsatisfiable formula A A B, an interpolant 
I is a formula over the shared symbols of A and B such that A => I and 
IA B is unsatisfiable. We denote as Itp(A, B) an interpolation procedure that 
computes an interpolant for unsatisfiable A ^ B. Various interpolation procedures 
exist, for propositional logic [31,42,34,19] as well as for different first-order 
theories [36,16,2,11]. 


3 Motivating example 


Throughout the paper we demonstrate our approach on a family of C-like pro- 
grams with a multi-phase loop (generalized from [44] where N=50) and an unsafe 
assertion. The use of parameter N (should not be confused with a nondeterministic 
variable) demonstrates the scale of search of counterexamples of different lengths. 
We have experimentally evaluated how various tools perform on this example in 
Section 5. The program source code and the corresponding transition system are 
given in Figure 1. 


x=0; y=N; 
while(x < 2N){ 

x =x +4; Init(z,y) =x =O0Ay=N 

if(x > N) Tr(x,y,0',y')=2<2NAe’ =24+1 

A ia Ay! = ite(a! > Nyy + 1,y) 
} 
Bad =xr>2N =2N 

assert(y != 2N); ad(x,y) =£ > AY 


Fig. 1: An example of unsafe multi-phase loop 


Since the assertion is placed after the loop, any counterexample requires 
finding a complete unrolling of the loop, i.e., all 2N iterations (or 2N steps in the 
corresponding transition system). Interestingly, even a linear growth of N results 
in the exponential growth of complexity of search of counterexamples. Because of 
the control-flow divergence in each iteration of the loop, the number of possible 
program paths (that a verifier explores) doubles with each increment of counter x. 
Our technique allows finding the counterexamples for any N drastically more 
efficiently. 
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input : transition system S = (Init, Tr, Bad) 
global : TPA sequence S (lazily initialized to true) 
Function CheckSafetyTPA((Init, Tr, Bad)): 
S[0] + Id V Tr 
if Sat?[Init(x) A S[0](x, x’) A Bad(x')] then return UNSAFE 
n<—O0 
while TRUE do 
res + IsReachable(n, Init, Bad) 
if res 4 Ý then return UNSAFE 
nent 
end 


2arNoaarwnrr 


Algorithm 1: Main procedure for checking safety 


4 Finding deep counterexamples with transition power 
abstractions 


Our main procedure for detecting safety violation—given in Algorithm 1—follows 
the typical scheme of bounded model checking where in each iteration the 
reachability of Bad is checked within certain bounded number of steps and 
the bound gradually increases. This scheme has also been adopted by other 
model checking algorithms, such as Spacer [30] and interpolation-based model 
checking [20,34,45], which further support a generalization/adaptation of the 
proof of bounded safety to a proof of unbounded safety. 

The distinguishing feature of our approach is that it increases the bound for 
the safety check exponentially in the number of iterations, while other approaches 
do this linearly. That is, in the n*” iteration, traditional algorithms check bounded 
safety up to n steps; but our approach does up to 2”+! steps. However, we do 
not unroll the transition relation an exponential number of times. Instead, we 
maintain a sequence of transition formulas (i.e., each formula contains only two 
copies of the state variables) where each element over-approximates twice as 
many steps of transition relation Tr as its predecessor. We call this sequence a 
Transition Power Abstraction (TPA) sequence. 


4.1 TPA sequence for bounded reachability queries 


The core of our approach lies in creating and refining a sequence of relations 
ATr=°, ATr<',..., ATr=",... where each relation over-approximates twice as 
many transition steps of a transition relation Tr as its predecessor. Formally, we 
require that n* relation ATr<” satisfies: 


Id(x,2') V Tr(a, 2’) V Tr?(a,a')V...V Tr?” (2,0!) => ATrS"(a,2') (1) 


The base for constructing a TPA sequence is ATr<° = Id V Tr. Thus, ATr=° is 
not an over-approximation, but a precise relation capturing true reachability in 
either 0 or 1 steps. 
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Our check for bounded safety is based on a procedure that answers bounded 
reachability queries: Given a set of source and target states, is any target state 
reachable from some source state in up to 2”*! steps (for n > 0)? The procedure 
uses the TPA sequence to answer such queries and, at the same time, it extends 
the sequence and refines its existing elements. 

Given two sets of states, Source and Target, and nt} element of the current 
TPA sequence ATr=", the following SMT query is issued: 


Sat?[Source(x) \ ATrS"(a, x) A ATrS"(a', x") A Target (2x")). (2) 


If query (2) is unsatisfiable, it means that there is no intermediate state 
that would be reachable from Source using one step of ATr<” and, at the same 
time, can reach Target in yet another step of ATr<". Since one step of ATrS<” 
over-approximates reachability (using Tr) in 0 to 2” steps, this means that no 
path of length <2”*! exists from Source to Target. Thus, the procedure can 
immediately conclude that no state from Target is reachable from any state in 
Source in <2"! steps. 

Additionally, it is also possible to learn new information about the reachability 
in <2"+! steps in the form of an interpolant between ATr<"(a, x’) A ATr="(x"', x’) 
and Source(«) A Target(x”). The properties of interpolation guarantee that the 
interpolant contains only variables z,x” (i.e., it does not contain x’), it over- 
approximates ATr$” o ATr<", and it does not relate any source state with a 
target state. The relation defined by such an interpolant satisfies condition (1) 
for the n+1* element of TPA sequence and the current TPA sequence can be 
refined by conjoining the interpolant (after renaming of variables) to its n+1** 
element. 

If query (2) is satisfiable, there exists some intermediate state m that can 


be reached from Source by one step of ATrS” and that can reach Target by 


yet another step of ATr=". If n = 0, the procedure returns and reports the 
answer “reachable” as ATr<° is precise, not over-approximating. Otherwise, such 
an intermediate state m can be seen as a potential point on the path from Source 
to Target, and this path can be shown to be real if there exist two real paths: 
from Source to m and from m to Target. The existence of these two real paths 


can be checked in a recursive manner. 


4.2 Algorithm for bounded reachability checks 


The pseudocode for the procedure is given in Algorithm 2. We first explain the 
steps in more detail and demonstrate a run of the algorithm on our example 
from Section 3. We then prove the correctness and termination of Algorithm 2 
from which follow the correctness of Algorithm 1 and its termination for unsafe 
systems. 

Function IsReachable takes as input an integer n > 0, a set of source states, 
and a set of target states. The output is a subset of target states that are reachable 
in <2”+! steps of transition relation Tr. The output set is empty if and only if 
no target state is reachable from any source state within the given bound. 
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input :level n, source states Source, target states Target 
output : subset of Target reachable from Source within 2”*' steps 
global : TPA sequence S 
Function IsReachable(n,Source, Target): 
1 while true do 
2 ATr<" + Sin] 
3 query < Source(x) ^ ATrS” (x, a’) A ATrS”(a', x") A Target(z") 
4 sat_res + Sat? [query] 
5 if sat_res = UNSAT then 
6 I + Itp(ATrS"(a, 2’) A ATrS"(a’, x"), Source(x) A Target(«’’)) 
7 Sfin + 1] + S[n + 1] A Ix” => a’ 
8 return @ 
9 else 
10 if n = 0 then return QE(Az, 2’ query)|x” > a] 
11 Intermediate + QE (3x, x” query)[z’ > zx] 
12 IntermediateReached <- IsReachable(n — 1, Source, Intermediate) 
13 if IntermediateReached = Ú then continue 
14 TargetReached + IsReachable(n — 1, IntermediateReached, Target) 
15 if TargetReached = 0 then continue 
16 return TargetReached 
17 end 
18 end 


Algorithm 2: Reachability query using TPA 


The procedure loops until it computes a truly reachable subset of target states 
or proves all target states unreachable. In each iteration the procedure reads 
the current n*™™ element of the TPA sequence (line 2). Note that this will be 
different in each iteration as the TPA sequence will be updated in the recursive 
calls on lines 13 and 15. After that, a satisfiability query is constructed and 
passed to a decision procedure for the background theory 7 (lines 3 and 4). 
The satisfiability query represents a question whether or not there exists an 
intermediate state that would be reachable from Source using one step of ATr<” 
and, at the same time, can reach Target in yet another step of ATr<”. 


Query on line 4 is unsatisfiable. If the query is unsatisfiable then no target state 
can be reached from any source state in two steps of ATr<”. It follows from 
Eq. (1) that no target state can be reached from any source state in <2”+! steps. 
Before indicating the unreachability by returning @ (line 8), the function updates 
the TPA sequence to ensure termination (discussed later): The function computes 
an interpolant between ATr<"(x, x’) A ATrS"(a’, x") and Source(x) A Target(x"’) 
(line 6). After renaming variables, the interpolant is conjoined to the n+1** 
element of the TPA sequence. The following example demonstrates this part of 
the procedure on our motivating example. 


Example 1. Consider the system from Figure 1 for N = 3. This system is not 
safe and the counterexample requires six steps of transition relation Tr. 
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After Algorithm 1 initializes the base element of TPA sequence to (a2’ = 
zay =y)V(a<6Aa =a4+1Ay' = ite(a’ > 3,y+1,y)) it issues a reachability 
query IsReachable(0,2 = 0 ^A y = 3,4 >6Ay = 6) in the first iteration of its 
loop. This translates to a satisfiability check of the formula 


rt=O0Ay=3 

A(@' =aAy =y)V(@<6Ae =2+1Ay = ite(x' > 3,y+1,y))) 
A(@"=2' Ay" =a) Ve < 6A" =x +1Ay" = ite(x" > 3,y' +1,4'))) 
An’ >6Ay" =6 


on line 4 of Algorithm 2. This query is unsatisfiable, and z” < x +2 is a possible 
interpolant computed on line 6. After variable renaming, this interpolant refines 
S[1], which becomes z’ < x + 2. Then this call to IsReachable terminates and 
the main loop issues a new reachability query for n = 1. This yields a satisfiability 
query z =OAy=38Aa' <a24+2An" <a'+2An">6Ay" =6. Again, this 
formula is unsatisfiable and a possible interpolant is 7” < «+4. The next element 
of the TPA sequence, S[2] is refined to z’ < x + 4. 

For n = 2 (reachability within eight steps), the query on line 4 is satisfiable, 
and the procedure switches to checking if the counterexample from abstract 
transition is real or exists only due to a coarse abstraction. 


Query on line 4 is satisfiable. If the query on line 9 is satisfiable, a concrete path 
of length <2”+1 cannot be ruled out at this point and the algorithm proceeds 
to recursively check the existence of one. In the base case n = 0 of the recursion, 
ATr=° is not an over-approximation but a precise relation representing 0 or 1 steps 
of Tr and there exists a real path from Source to Target. The algorithm computes 
a state formula representing a truly reachable subset of Target. This is done by first 
using quantifier elimination (QE) to eliminate all except next-next state variables 
from the query (line 10) and then renaming the variables to state variables.* 

If the base case has not been reached yet (n > 0), the procedure first computes 
a set of candidate intermediate states by eliminating all except next-state variables 
from the query (line 11). Then, the procedure recursively calls itself to determine 
the existence of a path from Source to the newly computed intermediate set with 
the bound on length halved (line 12). This check has two possible outcomes. In 
case the recursive call returns 0, none of the intermediate candidates is reachable 
(within 2” steps). Moreover, S[n] must have been strengthened (line 7) before the 
recursive call returned as to not relate any of the source states and intermediate 
candidates. The procedure then continues to the next iteration (line 13) where it 
tries to find new intermediate candidates or prove there are none anymore. In case 
the set returned on line 12 is non-empty, it represents a set of states reachable 
from Source within 2” steps of Tr. The procedure proceeds to check the existence 


4 QE computes maximal reachable subsets. While this is convenient for proving termi- 
nation of Algorithm 2, in practice quantifier elimination is a very expensive operation. 
Our implementation therefore supports also the use of model-based projection to 
efficiently under-approximate quantifier elimination (see Section 4.4). 
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of a path from these states to the target states (line 14). The reasoning here is 
the same as for the first recursive call: If Target is not reachable, the procedure 
attempts to find new intermediate candidates in a new iteration. Otherwise, real 
path from Source to Target exists and the computed truly reachable states are 
returned. The returned states are reachable with 2"*! steps as both recursive 
calls check reachability within 2” steps. 

We continue Example 1 to illustrate this phase of Algorithm 2. 


Example 2. Following Example 1, the algorithm is checking bounded reachability 
between Init and Bad for n = 2, i.e., within 8 steps. The issued satisfiability 
query isw@=OAy=3Aa' <a2t+4Aa" <a'+4Anx">6Ay" =6. Eliminating 
all except next-state variables yields x’ < 4A x’ > 2. This results in the call 
IsReachable(1,z = 0 ^A y = 3,4 <4Aa > 2). The satisfiability query issued 
next ist =OAy=3Aa2 <2+2Aa" <x 4+2Ax" <4Axz" > 2. This is 
again satisfiable and yields x’ < 2 ^A x’ > 0 after quantifier elimination. Now 
we reach level 0 with a call IsReachable(0, x = 0 Ay = 3,x < 2A x > 0). The 
constructed satisfiability query is again satisfiable and since we are at level 0, the 
procedure returns a set of states truly reachable from x = OA y = 3 within 2 steps. 
These can be characterized as (x = 0 V x = 1 V x = 2) A y = 3. The reachable 
states are reported to level 1 which issues reachability query for the second part: 
IsReachable(0, (x =OVx=1Vx=2)Ay=3,2 < 4A a > 0). This is also 
successful and returns reachable states (x =O Vax =1Va=2Va=3Va=4)A 
y = 3. These are states reachable from Init within 4 steps and they are reported 
to level 2. There, the second part of the counterexample is found in a similar way 
and the procedure concludes that Bad is truly reachable from Init within 8 steps. 

The behaviour of the algorithm on these examples can be generalized for the 
system of Figure 1 for larger values of N. The length of the counterexample is 2N 
and let | denote |log2(2N)|. The bounded safety will be quickly determined up 
to 2! steps with | calls to IsReachable which all return @ in their first iteration. 
On the next iteration, for n = l, IsReachable will find the real counterexample, 
but it requires O(2') recursive calls to find the counterexample of length in the 
interval (2', 2'+1]. 


4.3 Correctness and termination 


We first prove correctness and termination of Algorithm 2 which then entails 
correctness of Algorithm 1 and its termination for unsafe systems. We prove the 
correctness of procedure IsReachable separately for the unreachable and the 
reachable case. 


Lemma 1. If IsReachable(n, Source, Target) returns 0, then no state from 
Target can be reached from Source within 2”+1 steps. 


Proof. The proof relies on the invariant that S is always a TPA sequence, i.e., 
its elements satisfy the property of Eq. (1). This is obviously true when S is 
initialized in Algorithm 1. The only update of S happens in Algorithm 2 on line 7. 
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Consider an update on any level k < n. From the properties of interpolation, we 
know that I(x, x”) (on line 6) over-approximates ATr<* (x, 2’) A ATr=*(a’, x"), 
which represents two steps of the relation ATr<*. Since ATr<* over-approximates 
<2* steps of Tr, it follows that I(x, x”) over-approximates <2**! steps of Tr. 
Thus, conjoining it to ATr<**! preserves the condition of Eq. (1). 

It follows from Eq. (1) that when the query on line 4 is unsatisfiable, there 
exists no path of length < 2 x 2” = 2”+! from any source state to any target 
state. 


Lemma 2. If IsReachable(n, Source, Target) returns a non-empty set Res, 
then Res C Target and every state in Res can be reached from some state in 
Source in <2"+! steps. 


Proof. The proof is by induction on n. 

Base case: For n = 0 ATr<° represents precise reachability in 0 or 1 step. 
It follows that if the query on line 4 is satisfiable, some target states are truly 
reachable from the set of source states in <2 steps. Moreover, the properties 
of QE guarantee that Res = QE (3x, 2’ query)[a” > z] is a subset of Target(a) 
that are reachable from Source using ATr<° o ATr=°. 

Inductive case: Suppose the claim holds for n — 1. If at level n the procedure 
returned a non-empty set, it must have been the case that the first recursive 
call (line 12) returned a non-empty set IntermediateReached of states truly 
reachable from Source in <2” steps, by our induction hypothesis. Additionally, 
the second recursive call (line 14) also returned a non-empty set TargetReached 
that, according to our induction hypothesis, is a subset of Target truly reachable 
from IntermediateReached in <2” steps. It follows that TargetReached is a subset 
of Target truly reachable from Source in <2"+! steps. 


The correctness of procedure IsReachable extends naturally to the correct- 
ness of our main procedure. 


Theorem 1 (Correctness). If Algorithm 1 returns UNSAFE, then the system 
S is unsafe, i.e., some bad state is reachable from some initial state. 


Proof. The satisfiablity query on line 2 of Algorithm 1 checks reachability in 
0 and 1 step. If this query is satisfiable, there exists a counterexample path of 
length 0 or 1 from some initial state to a bad state. 

Otherwise, it enters the loop where UNSAFE is returned only if IsReachable 
returns non-empty set of states for some n. From the correctness of IsReachable 
it follows that the returned set is a subset of Bad that is reachable from Init in 
<2”"+! steps. Thus there exists a counterexample path in the system. 


Next, we want to show that if there exists a counterexample path in the 
system, our procedure will eventually report it. This boils down to the question 
of termination of a single call to IsReachable. 


Lemma 3. Assume that the satisfiability check (line 4) terminates, i.e., that the 
background theory T is decidable, and that T has procedures for interpolation and 
quantifier elimination.” Then a single call to IsReachable always terminates. 


5 The linear arithmetic theories of our experiments satisfy these assumptions. 
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Proof. The proof proceeds by induction on level n. The base case (n = 0) trivially 
terminates after a single satisfiability query on line 4. 

For the inductive case, consider the first iteration of the loop. If the query is 
unsatisfiable, the procedure terminates. If it is satisfiable, quantifier elimination 
yields a set of states Intermediate = {m | ds € Source, 3t € Target : (s,m) € 
ATr<" A (m,t) € ATr<"}. Now consider the first recursive call (line 12). By 
induction, it terminates. If it returns Ø, then, by properties of the interpolation, 
ATr<" has been strengthened such that Ys € Source,¥m € Intermediate : 
(s,m) ¢ ATrS" now holds. Consequently, in the second iteration the query 
on line 4 must be unsatisfiable and the procedure terminates. 

Now consider the situation where the recursive call on line 12 returned a 
non-empty set IntermediateReached. The procedure continues to the second 
recursive call (line 14), which also terminates, by induction. If the returned set 
TargetReached is non-empty, the procedure terminates (line 16). If it is empty, 
then no state reachable from Source in <2” steps of Tr can reach any state in 
Target in another <2” steps. Moreover, ATr<” has been strengthened so that 
now it does not relate any state from IntermediateReached with a state in Target. 
In the second iteration, the query on line 4 could still be satisfiable. However, 
the extracted Intermediate (of the second iteration) cannot contain states that 
are reachable from Source in <2” steps. Thus first recursive call (line 12) in the 
second iteration must return Ý and this is followed by an unsatisfiable query 
(line 4) in the third iteration and termination. 


The immediate consequence of Lemma 3 is that our main procedure will find 
a counterexample if one exists. 


Theorem 2. If there exists a counterexample in the system, Algorithm 1 termi- 
nates with UNSAFE result. 


4.4 Under-approximating QE with model-based projection 


Model-based projection (MBP) [30] is a recent technique for under-approximating 
quantifier elimination for existentially quantified formulas. In short, given an 
existentially quantified formula da¢(a,y), MBP is a function that maps each 
model of ¢ to a quantifier-free formula that implies dr¢(a, y) and is true in the 
model. Moreover, it is required that the function has a finite image (it produces 
only finitely many quantifier-free under-approximations) and the disjunction of 
the image is equal to the quantified formula. Efficient MBP for linear real and 
integer arithmetic was given in [30,10]. MBP has also been designed for algebraic 
datatypes [10], arithmetic signature of bit-vectors [23] and arrays® [29]. 

Quantifier elimination in Algorithm 2 can be replaced by MBP in a straight- 
forward way. On line 4, if the query is satisfiable, we obtain from the SMT solver 
a model witnessing the satisfiability. Then, on lines 10 and 11 we replace QE with 
MBP using the obtained model. It is easy to check that the proof of Lemma 2 
remains valid with this change, and thus also the result of Theorem 1. In Section 5 
we experimentally demonstrate the practical advantage of MBP over QE. 


6 MBP for arrays does not satisfy the finite image condition 
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4.5 Proving safety 


Even though the main purpose of the TPA sequence is to help to quickly rule out 
bounded reachability queries, it can also be useful in another way. Specifically, 
an element of the TPA sequence may turn out to be a transition invariant with 
respect to transition relation Tr. 


Definition 1 (transition invariant). We say that R(x,x') is a transition 
invariant if Tr* C R, i.e., Vz,2' Tr*(x,2') => R(a,2'), where Tr* is the 
reflexive transitive closure of Tr. 


Note that our definition is slightly simpler than that of [40], as it only depends 
on the transition relation and not, for example, on the initial states of the system. 

If we find a transition invariant that does not relate any initial state with a 
bad state, we can immediately conclude that the system is safe. We show one 
way how to detect if a member of the TPA sequence is a transition invariant 
using SMT query. 


Lemma 4. Assume that for some n, ATrS"0 Tr C ATr<" or that Tro ATr<” C 
ATr=". Then ATr=" is a transition invariant. 


Proof. We consider the case ATrS” o Tr C ATr$” and show that Tr* C ATrS”. 
The other case is analogous. Take any two states s,s’ such that s’ is reachable 
from s, i.e., (s,s’) € Tr*. We show that (s,s’) € ATr$” by induction on d, the 
length of the path from s to s’. If d < 2” then (s, s’) € ATr=" by Eq. (1). Assume 
now that d > 2”. Then there exists a state t such that t can be reached from s 
in d—1 steps and (t,s’) € Tr. By induction, we have that (s,t) € ATrS” and 
(s,s’) € ATr$” o Tr. By our assumption it follows that (s, s’) € ATr<"”. 


Note that when a call to IsReachable on line 5 in Algorithm 1 returns Ø, the 
n+1%* element of TPA sequence ATr=”*1 does not relate any initial and bad 
state. Thus we can check at this point for the conditions of Lemma 4, and, if 
satisfied, we can immediately conclude that no counterexample (of any length) 
exists in the system and report safety. 

In fact, to detect that no counterexample exists, the assumptions of Lemma 4 
can be relaxed a bit. We can consider the restriction of these relations to only 
initial or bad states. The notation A < R denotes a domain restriction of a binary 
relation R to a set A, i.e., (x,y) € ASR iff (x,y) € RA z € A. Similarly R> B 
denotes the codomain restriction, i.e., (x,y) € Ro B iff (x,y) €E RAyEB. 


Lemma 5. Assume that for some n Init < ATrS™ o Tr C Init < ATrS”. Then 
Init < Tr* C Init 4 ATr=". Similarly, if Tr o ATrS” > Bad C ATrS™ > Bad, then 
Tr* ò Bad C ATr<" > Bad. 


Proof. Same as the proof of Lemma 4, with appropriate restrictions. 


Lemma 5 represents a weaker form of Lemma 4: it has a weaker assumption 
and a weaker conclusion. Nevertheless, the conclusion is still strong enough to 
ensure that no counterexample exists and conclude safety. 
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5 Experiments 


We have implemented our TPA-based procedure (Algorithm 1) in our new 
CHC solver Golem’. Golem is built on top of the interpolating SMT solver 
OpenSMT [26]. In our experiments we used version 2.2.0 of OpenSMT®. 

To gauge the feasibility of our algorithm we performed a set of experiments. 
All experiments were conducted on a machine with AMD EPYC 7452 32-core 
processor and 8x32 GiB of memory. We compared our approach to the current 
state-of-the-art tools Eldarica 2.0.6 [25], IC3-IA 20.04.1 [15] and Z3 4.8.12 [39] 
(using both its BMC [9] and Spacer [30] engines), which were the top competitors 
in CHC-COMP 2020 and 2021 [43,21]. We used both versions of our algorithm 
in the experiments: using MBP (TPA-MBP) and QE (TPA-QE). The format 
of all the benchmarks is that of the constrained Horn clauses (CHCs) used in 
the CHC-COMP. Since IC3-IA’s input format differs, all CHC benchmarks were 
translated to VMT format using the automated tool packaged with IC3-IA.° 

The goal of the first experiment was to investigate the scalability of our 
algorithm with respect to the length of the counterexample and compare its 
performance to the state-of-the-art tools. We used the parametrized transition 
system from our motivating example in Section 3. The counterexample in this 
system has length 2N and we ran the tools on instances for N ranging from 1 
to 511. The timeout was set to 300 seconds. Figure 2 shows the runtime of the 
tools for the given value of N. 

TPA-MBP was able to report all instances as unsafe, needing less than two 
seconds for each instance. Eldarica, IC3-IA and Z3-BMC exhibit relatively stable 
pattern where the performance decreases rapidly with increasing N. Z3-Spacer, 
on the other hand, exhibits a curious behaviour where it is able to solve most of 
the instances (even though it is slower than TPA-MBP by at least an order of 
magnitude), but on a relatively large number of instances it times out, and we 
were not able to understand the pattern on which instances this happens. Quick 
look at the instances for N < 100 suggests that on some instances its behaviour 
is much closer to that of IC3-IA. Finally, TPA-QE also shows an interesting 
pattern in its runtime where its performance drops considerably on every power 
of two, and then it slowly improves for larger N until the next power of two. 

This first experiment showed very promising results for TPA-MBP which 
benefited from the fact that the reason why shorter counterexamples do not exist 
can be summarized relatively easily. It scaled exceptionally well compared to the 
state-of-the-art tools, as well as TPA-QE. 

To confirm the results from the first experiment, we continued with the second 
set of benchmarks representing instances of our targeted type of problems. They 
represent assertions over multi-phase loops, which are known to be difficult to 
analyze by state-of-the-art techniques. We took 54 safe multi-phase benchmarks 


7 https://github.com/usi-verification-and-security/golem; commit 4eala53 

8 https: //github.com/usi-verification-and-security /opensmt 

° Full results of the experiments available at http://verify.inf.usi.ch/horn-clauses/tpa/ 
experiments. Artifact available at https://doi.org/10.5281/zenodo.5815911 
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Fig. 2: Runtime for motivating example for N from 1 to 511 (log y-axis) 


from CHC-COMP repository’? and then for each benchmark created its unsafe 
version with a minor modification of the safety property.'! In most cases this was 
done by negating one of the conjuncts of the property. In a few cases this resulted 
in a simple benchmark with a very short CEX (< 10 steps), but in most cases, 
the minimal counterexample is much larger, ranging from a few hundreds to a 
few tens of thousands of steps. There are even a few extremes where the minimal 
counterexample requires hundreds of thousands or even millions of steps. 

With the timeout of 300 seconds, out of 54 benchmarks, TPA-QE solved 20 
and TPA-MBP solved 35 benchmarks, beating the other tools among which Z3- 
Spacer performed the best, solving 20 benchmarks. The results are summarized in 
Figure 3 where the number of solved benchmarks by each tool is plotted against 
the time needed for their solving. 

Overall, our tool solved 15 benchmarks that none of the other tools was able 
to solve and in general could be one or two orders of magnitude faster. There 
were two noticeable exceptions: benchmark 24 was uniquely solved by Z3 and 
benchmark 39 was uniquely solved by IC3-IA (for benchmark numbering, see the 
link in footnote 11). We found out that in the latter case our tool suffered from 
incompleteness in the decision procedure of OpenSMT for integer arithmetic, 
while in the former case the interpolation used by our algorithm was not producing 
good abstractions and we suffered from the need for frequent refinements. 

We also examined the solved benchmarks for the length of the minimal 
counterexample they admit. The results are in line with the observations from our 
first experiments: Other tools could only solve benchmarks with a counterexample 


10 https: //github.com/chc-comp/aeval- benchmarks 
11 Benchmarks available at https://github.com/blishko/chc-benchmarks. 
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Fig. 3: Results on 54 multi-phase unsafe benchmarks 


of up to a thousand steps (1001 steps in benchmark 17 solved by Z3-Spacer). 
TPA-QE matched this performance (1001 steps in benchmark 27), but TPA-MBP 
managed to solve benchmarks with a counterexample of more than ten thousand 
steps (17650 in benchmark 42). Thus, our technique significantly improves upon 
state-of-the-art with respect to the length of the counterexample it can detect. 

Finally, we successfully tested our implementation on the safe version of 
the 54 multi-phase benchmarks and on the general set of 498 benchmarks from 
CHC-COMP’21, the category of transition systems over linear real arithmetic. 
TPA-MBP managed to prove 10 of the multi-phase benchmarks safe. Z3-Spacer, 
1C3-IA and Eldarica proved safe 9, 20 and 26 of these benchmarks, respectively. 
On the CHC-COMP LRA-TS benchmark set, TPA-MBP was able to solve 70 
unsafe benchmarks (from 90+ known unsafe benchmarks in the set) and 67 safe 
benchmarks. 


6 Related work 


Loop acceleration [5,12,22] is a related approach for loop analysis that enables 
both proving safety and detection of deep bugs. It transforms the loop to a single 
quantifier-free formula representing all possible executions of the loop. While 
offering significant improvement for a limited types of integer loops, it is not 
applicable for code with control-flow divergence and/or data structures. Accelera- 
tion has also been combined with interpolation-based model-checking [13,24]. In 
contrast, our technique does not accelerate paths but builds over-approximations 
of bounded number of iterations. It is not restricted to any specific type of loops, 
and it works over any theory supporting interpolation and quantifier elimination. 

Another technique for fast detection of deep counterexamples for C programs 
was proposed in [32]. Given a path through a loop, it computes a new path that 
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under-approzimates an arbitrary number of iterations of the original path. In 
contrast to loop acceleration, this technique only under-approximates the loop 
behaviour, but it can handle conditionals and richer background theories. Our 
approach targets the same goal but it is over-approximating, which allows for 
detecting (transition) invariants and proving safety. Their prototype aims at C 
programs only (and does not seem to be maintained anymore). Our implementa- 
tion works on transition systems in the form of constrained Horn clauses (CHC) 
and thus is agnostic to the programming language. 

Abstracting transition relation using interpolation has been employed in [27]. 
They use interpolation to compute and refine abstract version of the transition 
relation. However, they abstract only a single step of the transition relation. 
Instead, we use interpolation to compute relations that over-approximate multiple 
(and increasingly larger number of) steps of the transition relation. 

Transition invariants [40] have been successfully employed for proving liveness 
properties, especially termination [33,41]. Our technique can discover transition 
invariants and use them to prove safety. However, in this paper we focused on find- 
ing counterexamples and the directed search for invariants is left for future work. 

Our technique can find a possible application in automating test-case genera- 
tion. A given program can be automatically annotated with assertions representing 
the reachability of all the branches. Having the goal to detect a set of input values 
for maximizing the test coverage [46], our technique would be called repeatedly 
to find many counterexamples for a subset of assertions (including deep ones) 
and prove the unreachability of the remaining ones. 


7 Conclusion and Future Work 


This paper introduces a novel model-checking algorithm for safety properties 
of transition systems with a focus on finding deep counterexamples. The idea 
is based on maintaining a sequence of transition formulas, called the transition 
power abstraction (TPA) sequence, where each element over-approximates a 
sequence of transition steps twice as long as its predecessor. The sequence is 
used in answering bounded reachability queries, which in turn results in new 
information that further refines the sequence. We proved the correctness of this 
algorithm and showed that it eventually finds a counterexample if one exists, 
assuming the background theory admits interpolation and quantifier elimination. 
For performance reasons, our implementation applies quantifier elimination lazily 
using model-based projection that lets the approach to outperform state-of-the- 
art on a class of problems with multi-phase loops. The experiments confirmed 
that it is able to detect counterexamples of much greater depth than existing 
tools within the same time constraints. 

As future work, we plan to investigate possible improvements of the algorithm 
and tailor it for finding transition invariants. This would contribute to its ability 
to prove programs safety and enable the modular reasoning to support arbitrary 
systems of constrained Horn clauses. 
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Abstract. Diagnosability is a fundamental problem of partial observ- 
able systems in safety-critical design. Diagnosability verification checks 
if the observable part of system is sufficient to detect some faults. A 
counterexample to diagnosability may consist of infinitely many indis- 
tinguishable traces that differ in the occurrence of the fault. When the 
system under analysis is modeled as a Biichi automaton or finite-state 
Fair Transition System, this problem reduces to look for ribbon-shaped 
paths, i.e., fair paths with a loop in the middle. 

In this paper, we propose to solve the problem by extending the liveness- 
to-safety approach to look for lasso-shaped paths. The algorithm can be 
applied to various diagnosability conditions in a uniform way by changing 
the conditions on the loops. We implemented and evaluated the approach 
on various diagnosability benchmarks. 


Keywords: Diagnosability- Model checking - Liveness to safety 


1 Introduction 


The design of fault detection mechanisms is a standard part of the design of 
safety-critical systems. Faults are usually not directly observable. They are di- 
agnosed by observing a sequence of observations and inferring the value of un- 
observable variables based on a system model. A fundamental question for the 
design of such partially observable systems is to determine if it always possible 
to detect a fault. Diagnosability verification is the problem of checking whether 
the available sensors are sufficient to determine the occurrence of a fault. 

Historically, diagnosability verification is reduced to a model checking prob- 
lem looking for a critical pair of indistinguishable traces that differ with respect 
to the fault. This pair witnesses the impossibility to detect the fault along such 
sequence of observations. 

When considering fair transition systems, critical pairs are not sufficient and 
it is necessary to look for infinitely many indistinguishable traces. In case of 
finite state systems, such set of infinite traces can be represented by ribbon- 
shaped paths, i.e., paths with a loop in the middle. Previous solutions, hinted 
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in [16], were based on either bounded model checking, so not able to prove diag- 
nosability (absence of the critical ribbon-shaped paths) or BDD-based fixpoint 
computation, which suffers from the problem of precomputing the fair states. 

In this paper, we propose a new approach based on the liveness-to-safety 
construction [3], where the search for a (single) lasso shaped path is reduced 
to an invariant property. Like in liveness-to-safety, we use additional variables 
to guess the loopback states, which in the case of ribbon-shaped paths are used 
twice, the first time for the loop in the middle, the second time for the final lasso. 
Additional constraints are added to encode the looping conditions that must hold 
in the two loops for encoding the diagnosability problem. The algorithm can 
be applied to various diagnosability conditions in a uniform way by changing 
the conditions on the loops. We implemented and evaluated the approach on 
various diagnosability benchmarks. Different algorithms have tested to solve the 
resulting invariant model checking problem, showing better performance with 
respect to the fixpoint-based approach. 

The main contribution of the paper is the extension of liveness-to-safety 
to generate an infinite number of traces. The set is in the form of a ribbon 
shape (in other words, in the form a; b*; c;d”) and may have applications beyond 
the diagnosability problem, e.g., to solve non-interference problems requiring 
infinitely many different traces [13] or to counterexample-guided abstraction 
refinement. 

The rest of the paper is organized as follows. In Section 2, we give an overview 
of related work. Section 3 defines the necessary formal background. The main 
problem along with the original solution is presented in Section 4. Our main 
contribution is introduced in Section 5, where we present the novel solution and 
prove its correctness. Section 6 contains the experimental evaluation comparing 
our solution with the original one. Finally, in Section 7 we give conclusions and 
directions for future work. 


2 Related Work 


The problem of diagnosability [17] refers to the possibility of inferring some 
desired information (e.g., the occurrence of a fault) during the execution of 
a system, in a partially observable environment. Hence, diagnosability can be 
phrased using hyperproperties, namely as a property of the traces representing 
the execution of the system [5,16]. 

In [16] it has been shown that the problem of diagnosability under fairness 
can be reduced to the search for ribbon-shaped paths, i.e. paths with a loop in 
the middle, where specific conditions on the occurrence of faults are imposed. 
Historically, diagnosability has been defined in the context of Discrete-Event 
Systems [17], without taking fairness into account. In [14] fairness is considered 
only in the context of live systems, i.e. under the hypothesis that every finite trace 
can be extended to an infinite fair trace, and fair diagnosability is introduced 
only informally. In this context, our ribbon-shaped fair critical pair corresponds 
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to the critical pair of [14], where the faulty trace must be fair while the nominal 
trace may be unfair. 

A construction similar to ribbon-shaped paths, called doubly pumped lasso, 
is used in [13] as a building block to address the problem of model checking 
a class of quantitative hyperproperties, as in the problem of quantitative non- 
interference (i.e., bound the amount of information about some secret inputs 
that may be leaked through the observable outputs of the system). 

In [13] the problem of verifying quantitative hyperproperties is addressed us- 
ing a model checking algorithm based on model counting, which is shown to have 
a better complexity than using an HyperLTL model checker, and a Max#SAT- 
based implementation. In [16], the authors address the problem of checking diag- 
nosability using an extension of the classical twin-plant construction [15] and an 
LTL model checker. The approach we use in this paper builds upon the approach 
of [16], but uses an extension of the liveness-to-safety approach [3], instead. The 
extension omits the computation of fair states and keeps the representation of 
the system symbolic, which is more space efficient. The problem is reduced to the 
reachability problem. The problem is well-studied, thus we may take advantage 
of already developed algorithms for checking reachability. 


3 Background 


3.1 Symbolic Fair Transition Systems 


The plant under analysis is represented as a finite-state symbolic fair transi- 
tion system (SFTS). An SFTS is a tuple (V, I, T, F}, where V is a finite set of 
Boolean state variables; I is a formula over V defining the initial states, T is 
a formula over V, V’ (with V’ being the next version of the state variables) 
defining the transition relation, and F is a set of formulas over V defining the 
fairness conditions. If F = 0, we call it a symbolic transition system (STS) and 
write (V,I,T). 

We remark that the choice of representing the plant in form of an SFTS 
does not restrict the generality of the framework. In fact, it is possible to encode 
labeled transition systems and discrete event systems. 

A state s is an assignment to the state variables V. We denote with s’ the 
corresponding assignment to V’. Given an assignment to a set V of Boolean 
variables, we also represent the assignment as the set of variables that are as- 
signed to true. Given a state s and a subset of variables U, we denote with sjy 
the restriction of s to the variables in U. 

In the following we assume that an SFTS P = (V,J,T, F} is given. 

Given a sequence of states ø, we denote with a” the sequence obtained by 
repeating o for k times, and o“ the sequence obtained by repeating o for an 
infinite number of times. 

Given a state so of P, a trace of P starting from so is an infinite sequence 
T = 80, $1, $2,... of states starting from so such that, for each k > 0, (sk, Sk+1) 
satisfies T, and for all f € F, for infinitely many i > 0, the formula f is true in 
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si. If sọ is initial, i.e., it satisfies J, then we say that ~ is a trace of P. We write 
ITp for the set of traces of P. 

We denote with z[k] the k+ 1-th state s of m. We say that s is reachable (in 
k steps) in P iff there exists a sequence 7 = sos1 . . . Sk, where sk = s, So satisfies 
I and every (s;,8;41) satisfies T. A state s is fair if there exists a trace starting 
from s. 

Given a trace 7 = So, $1, S2,... and a subset of variables U C V, we denote 
by my = Sov; S1)u, $2\u,--- the projection over the variables in U. 

Let S! = (V1, I1, T}, F,) and S? = (V?,I?, T?, Fy) be two SFTSs. We define 
a synchronous product S! x S? asthe SFTS (V+ U V?, I! AI7,T!AT?, F U Fo). 
Every state s of S' x S$? is an assignment to the two sets of state variables V+ 
and V? such that st = sjy: is a state of St and s? = sjy2 is a state of S?. 

Let p be a propositional formula over V. We write s — p iff s satisfies p, and 
m,i H p if [2] satisfies p. We write P | p iff for all reachable s in P it holds 
that s = p. Let y be a formula over an infinite trace expressed in LTL [12]. We 
write 7 = y iff ọ is true on the trace 7. We write P |= ọ iff for all traces 7 in 
ITp it holds that 7 = y. 

In the rest of the presentation, we sometimes use a context, which we express 
as an LTL formula Y, to restrict the set of traces of the plant. This is useful to 
address the problem of diagnosability under assumptions. Note that, since our 
framework supports plants with fairness constraints, the incorporation of the 
context can be done (see, e.g., [10]) by converting the context into an SFTS Sy 
(representing the monitor automaton for the LTL formula) and replacing the 
plant P with P x Sw (the synchronous product of the plant with the monitor 
automaton). 


The Twin Plant Construction The twin plant construction of a plant P over 
a subset Y C V of variables (the observable variables), denoted TwIn(P, Y) and 
originally proposed by [15], is based on two copies of P, such that a trace in 
the twin plant corresponds to a pair of traces of P. In the security domain, two 
copies of a system used for verification are known as a self-composition [2]. 

The twin plant can be defined as the synchronous product of two copies of 
the SFTS corresponding to the plant. Formally, given a plant P = (V,I,T,F), 
we denote with Pr = (VL, IL, Tt, FL) and Pr = (Vr, Ir, Tr, Fr) the (‘left’ and 
‘right’) copies of P, obtained by renaming each variable v as vz or vp, respec- 
tively (i.e., if O € {L, R}, then Vp stands for the set of variables {ug | v € V}. 
Moreover, we define a formula OBSEQ stating that the sets of observable vari- 
ables of the two copies are equal at the given point. The twin plant of P is 
defined as follows. 


Definition 1 (Twin Plant). Given a set of variables Y C V, the twin plant 
of P = (V,I,T,F) is the SFTS Twin(P,Y) = Pr x Pr. Moreover, we define 
the formula OBSEQ = [cy UL = UR. 


There is a one-to-one correspondence between Jp x Tp (pairs of traces of P) 
and ITpwin(p,y) (traces of TWIN(P, Y)). A trace of TWIN(P, Y): m = (80,1, 80,R); 
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(s1,L;S1,R) --- can be decomposed into two traces of P: Left(7) = so L, S1,L,--- 
and Right(7) = So,R, S1,R,.... Conversely, given two traces my and mp in Hp, 
there is a corresponding trace in ITpywicp,y), denoted by my X TR. 


3.2 Liveness to Safety (L2S). 


The liveness-to-safety reduction (L2S) [3] is a technique for reducing an LTL 
model checking problem on a finite-state transition system to an invariant model 
checking problem. The idea is to encode the absence of a lasso-shaped path 
violating the LTL property FG-—f as an invariant property. 

The encoding is achieved by transforming the original transition system S' to 
the transition system S25, introducing a set X of variables containing a copy = 
for each state variable x of the original system, plus additional variables seen, 
triggered and loop. Let S = (X,I, T}. L2S transforms the transition system in 
S28 = (X28, Ines, Ths) so that S = FG—f if and only if Sos = abadt28, 
where: 


Xios = X U XU {seen, triggered, loop} 
Itos = IA 7seen ^A triggered \ sloop 
Tis = TAl Ago 4 Z| 
A[seen’ => (seen V A\x(z <=> %))] 
A|triggered’ <= > (triggered V (f A seen’)) 
A{loop’ <=> (triggered! ^ \y(a' => aN 
bady25 = loop 


The variables X are used to non-deterministically guess a state of the system 
from which a reachable fair loop starts. The additional variables are used to 
remember that the guessed state was seen once and that the signal f was true 
at least once afterwards. 


4 The Problem of Ribbon-Shaped Paths 


4.1 The Diagnosability Problem 


The observable part obs(s) of a state s is the projection of s on the subset 
Y of observable state variables. Thus, obs(s) = sjy. The observable part of 7 
is obs(7) = obs(sq), 0bs(s1), 0bs(s2),... = my. Given two traces 7 and 72, we 
denote by OBSEQUPTO(m1, 72, k) the condition saying that, for alli, 0 < i < k, 
obs(m[2]) = obs(ma[?]). 

Let 8 be a formula over V representing the fault condition to be diagnosed. 
We call @ a diagnosis condition. A system is diagnosable for (6 if there exists 
a bound d such that after the occurrence of 8, an observer can infer within d 
steps that 8 indeed occurred. This means that any other trace with the same 
observable part contains 3 as well. Formally, it was first defined in [17] as follows. 
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Fig. 1: The light bulb example and an example of a ribbon-shaped critical pair 
in the light bulb. 


Definition 2 (Diagnosability). Let P be a plant and 8 a diagnosis condition. 
P is diagnosable for B iff there exists d > 0 such that for every trace ™ and 
index i > 0 such that 7,1 = 6, it holds: 


Jj ENi< j < i+d:-(Virg:OBSEQUPTO(m1, 72, 7) =>dkeNk< jT2,k = BJ): 


— 


The above definition requires a global bound, while when considering fair 
transition systems it is possible that the occurrence of 8 can be inferred eventu- 
ally, but without a fixed bound. That is the motivation of extending the definition 
to fair diagnosability [16]. 


Definition 3 (Fair Diagnosability). Let P be a plant and B a diagnosis con- 
dition. P is fair-diagnosable for 8 iff for every trace mı, there exists d > 0 such 
that for every index i > 0 such that 11,7 = 8, it holds: 


Jj ENi< j <i+d-(Vrg-OBSEQUPTO(m, 72,7) > Ik EN k < j-70,k = 8)). 


amm 


Example 1. Consider the state machine of a light bulb as shown in Figure 1a, 
with the observable value OFF/ON and the diagnosis condition 8 = KO. Con- 
sider the following context: G(KO —> F OF F) \G(OK —> F ON). Intuitively, the 
LTL formula states that globally a state where KO holds is followed eventually 
by a state where OFF holds, and similarly a state where OK holds is followed 
eventually by a state where ON holds. Therefore, if the execution reaches KO, it 
will eventually go into state OF F/KO and remain there forever. If an execution 
is instead always OK, then it will visit infinitely often the state ON/OK. We 
can prove that condition £ is not fair-diagnosable according to Def. 3. In fact, for 
every j, there exists a trace without (6 that is observationally equivalent up to 
j to the trace with 8. Notice how the fairness condition causes the observations 
after a failure to always diverge eventually, but that this event can be delayed 
indefinitely. 


4.2 Ribbon-Shaped Critical Pairs 


Figure 2 illustrates the concept of ribbon-shaped paths. The formal definition is 
as follows. 
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Fig. 2: Ribbon-shaped critical pair 


Definition 4 (Ribbon-Shaped Critical Pairs (RCP)). Let P be a plant and 
B a diagnosis condition. We say that 7,72 E€ Hp are a ribbon-shaped critical 
pair for the diagnosability of B iff there exist k,l such that O < k < l and: 


1. mı[l] = mı|k] and mə|l] = alk]; 
2. OBSEQUPTO(T1, 7, 1); 

3. 7,1 = B for some i,0 <i <l; 
4. T2, i Æ B for alli,O<i<l. 


For fair diagnosability, the definition is similar: 


Definition 5 (Ribbon-Shaped Fair Critical Pairs (RFCP)). Let P be a 
plant and P a diagnosis condition. We say that 7,72 E€ Ip are a ribbon-shaped 
fair critical pair for the diagnosability of B iff there exist k,l such thatO<k <1 
and: 


1. mil] = m4 [k] and mə|l] = mak]; 

2. Tı is in the form so, 81,--- Sk, (Sk4+1,--- S1)”; 
3. OBSEQUPTO(m, 72, 1); 

4. mı, i - B for some i > 0; 

5. nat EB for alli,O<i<l. 


In this paper, we use a slightly different definition than the one given in [16]. 
Definition 5 includes an additional constraint on mı by requiring a loop shape. 
However, these two definitions are equivalent and the proof of it can be found 
in the extended version of the paper. 


Example 2. Fig. 1b shows an example of a ribbon-shaped critical pair for the 
light bulb of Example 1. 


We can prove that, in the general case, 8 is not diagnosable if and only if there 
exists a ribbon-shaped critical pair. In other words, ribbon-shaped critical pairs 
are necessary and sufficient for diagnosability violation. The following theorem 
is adapted and extended from [16] and can be proved in a similar way. 
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Theorem 1 (RCP necessary and sufficient for diagnosability). Let P 
be a plant. P is not diagnosable for P iff there exists a ribbon-shaped critical 
pair for the diagnosability of 8. P is not fair diagnosable for B iff there exists a 
ribbon-shaped fair critical pair for the fair diagnosability of 8. 


The proof can be done similarly as in [16]. The theorem in [16] is proved for 
asynchronous systems while here we assume that the plants in the twin plant 
are synchronized on the observable part. 


4.3 Fixpoint-based Algorithm 


The ribbon-shaped structure requires to eventually reach a loop (the ribbon), 
from which it is possible to branch with a fair suffix (the final lasso). Therefore, 
it combines path and branching conditions, and can be encoded into a CTL* 
formula [16] over variables of the twin plant. We can verify whether the formula 
holds in the twin plant using a fixpoint-based algorithm. Actually, the specific 
structure allows for a simple implementation on top of standard BDD-based 
model checking [16]: it is sufficient first to compute the set of fair states, then 
to compute the set of fair states staying forever in the looping condition, and 
finally to look for an initial state reaching such loop. 

The main issue of this approach is the computation of fair states, which is 
performed independently from the diagnosis condition and may be a bottleneck 
in case of complex fairness conditions. 


5 Extended Liveness to Safety 


In this section, we propose a novel algorithm for finding RCPs and RFCPs in fair 
symbolic transition systems. The algorithm extends L2S such that it searches 
for two consequent loops instead of one. We define a ribbon structure, which is 
constructed from the twin plant. The ribbon structure is parametrized, thus it 
can be used for finding both RCPs and RFCPs with only a slight modification. 
We prove that a certain state is reachable in the ribbon structure if and only if 
there exists an RCP/RFCP in the original structure. 

The ribbon structure extends the twin plant of the original structure with a 
new copy of state variables, new flags, and new transitions that constrain the 
behaviour of the new variables and the flags. In the following, we describe how 
the twin plant is extended and we formally define the ribbon structure. 


5.1 Definition of the L2S Extension 


The ribbon structure is parametrized by SFTS P, two propositional formulas 
p and q and two sets of propositional fairness conditions Fı and Fə. These 
parameters are later instantiated depending on the specific ribbon-shaped path 
that is considered. In particular, p represents the diagnosis condition 6 in the 
left copy of the twin plant and q represents the negation of 8 in the right copy 
conjoined with the constraint to force the same observations on the two copies. 
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The ribbon structure P. and the propositional formula y~ are defined such 
that any path p of Py on which p~ is reached satisfies the following conditions: 


— p contains two consequent loops Lı and Lz that satisfy fairness conditions 
F and F> respectively; 

— p is satisfied in some state of p before the end of the first loop; 

— q is satisfied in all states of p before the end of the first loop. 


In the rest of this section, we formally define the set of variables, the initial 
formula and the transition formula using the parameters described above. 


Variables Similarly as in the original liveness-to-safety reduction, we create a 
copy of all state variables of the twin plant. The copy variables serve as a guess 
of the state representing a loopback. The variables are denoted by overline and 
defined as V = {0 | v € V}, where V is the set of variables of the twin plant 
P. The variables are reused both for the first and the second loop, where the 
second loop is a fair loop. 

The flags are auxiliary variables used to monitor whether a loop was found 
and whether all loop conditions were satisfied. The set of flags is defined as 
Vm = {Mseen: ML , Mp, Mg} U Ujer, Mi; U User, m,,;. The intuition 
behind each flag is as follows: 


Mseen is true <=> the loopback (either the first or the second one) was already 
seen and is saved in V; 


mz, is true => the first loop was already found; 
m, is true <=> p was true; 
m, is true <> q was true in all previous states; 
mı; is true = f; € Fı was true in the first loop; 


mz; is true — > f; E€ Fo was true in the second loop. 


In addition, when Mseen is true, the current state is in a loop. If mz, is false, 
it is in the first loop. Otherwise, the first loop was already found and the current 
state is in the second loop. 


Auxiliary Formula The following formula yz, states requirements for find- 
ing the loopback of the first loop Lı. We need that the conditions on p and q 
are satisfied and that Lı was yet not found. In addition, we need that the fair- 
ness conditions were true and that the current state is the same as the guessed 
loopback. 


PL, >= Mp A M4 A DML, A Mseen ^ \ mı; ^ \ vov 
fie Fi vEV 


Initial Formula All flags besides my are initialized to false, mg is initialized 
to true: 


7“Mgeen ^ “mz, A “Mp A My A VAN “Mi, A VAN 7M92 4 (I1) 
ficFı fiEFo 
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Transition Formulas We define transitions (T1)—(T8) to ensure the correct 
behaviour of the introduced variables such that the conditions mentioned above 
are satisfied. The transitions and their intuitive descriptions are as follows. 


— Anytime Mgcen is set to true, in the next state the copied variables are set 
to the state variables of the current state: 


7Mseen ^ seen. => \ v =v (T1) 
vEV 


— If Mgeen is true, the values of the copy variables are preserved also in the 
next state: 
Mseen = \ T= (T2) 
vEV 


— The flags mz; can change to true only when f; € Fs is true and the current 
state is in Ly: 


\ ( (mi; =m) V (m; 1 A Mgeen A amy, A fi) ) (T3) 
fiEFı 

\ ( (m2; = m32) V (m2, A Mseen A Mz, A fi) ) (T4) 
SiC Fo 


— mz, can change to true only when the first loop was found, as specified 
above by z, and it forces Mseen to be set to false: 


(m =m) V (PLi A7Mscen’ Amz’) (T5) 
— Mseen can change to false only when L; was just found: 
Mseen => (Mseen V (Mseen \7>mz, Amg, ’)) (T6) 
— m, can change to true only when p is true: 


(m, = mp) V (pA my, ) (T7) 


Anytime q is false, mg goes to false and stays false: 


(=m, V ~q) m’ (T8) 


Note that the transitions (T3), (T4), (T5) and (T7) imply that flags m,;, 
my,, Mp can change their value from false to true only once and then they stay 
true. The transition (T8) implies that m, can change its value from true to 
false only once and then it stays false. Finally, (T6) implies that Mseen is set to 
false exactly once, when Lı is found, and thus set to true exactly twice, when a 
loopback of either Lı or Lə is guessed. 


Searching for Ribbon-Shaped Paths in Fair Transition Systems 553 


Ribbon Structure Putting together the variables and formulas defined above, 
we give the following definition of the ribbon structure. 


Definition 6. For the plant P = (V, I,T, F), the propositional formulas p, q and 
the sets of propositional formulas Fy, Fz over V, let (Vn, I~, Tn} be a symbolic 
transition system where: 


- V =VUVUVm; 
- I =IA^A(Il); 
— T. = T A (T1) A (T2) A (T3) A (T4) A (T5) A (T6) A (T7) A (T8). 


We call this STS a ribbon structure and denote it by RIBBON(P, p, q, Fi, F2). 


To finish the reduction, we define the reachability condition. Intuitively, the 
condition should express that the second loop was found. This means that the 
first loop was already found, all fairness conditions in Fə were true and the 
current state is the same as the guessed loopback: 


Pu = ML A Mseen ^ \ Mə; ^ VAN v=v. 
fiEFz vEV 


In the next section, we show how the reachability in a ribbon structure is 
used to find RCPs and RFCPs and we prove that our construction is correct. 


5.2 Correctness 


The ribbon structure and the reachability condition are defined such that any 
satisfiable trace contains two consecutive loops. The definitions of RCP and 
RFCP describe only the first loop. Not all critical pairs contain the second 
loop. However, using the following propositions, we claim that the existence of 
a critical pair implies existence of a critical pair with two loops, where the first 
loop is as in the original pair and the second loop is fair. This fact is necessary 
to prove that if P contains a critical pair, we can find a critical pair with two 
loops in the ribbon structure. 


Proposition 1. Let 7 be a trace of an SFTS P. Then, any prefix of m can be 
extended to a trace tr that ends with a fair loop. 


Proposition 2. Let 7 = s1,82,53..., T2 = ti, to,t3... be traces of SFTS P 
that end with a fair loop. Then, the path (s1,t1), (82, t2), (83, t3)... is a trace of 
P x P that ends with a fair loop. 


The first proposition is true because we consider only finite systems. In a 
finite system, any infinite fair suffix contains a state that is repeated infinitely 
many times. Thus, there must be two occurrences of the state in between which 
all fairness conditions are true at least once. The second proposition is true 
because we can unroll the fair loops of mı and 72 until both of them are in loop 
and then we match the period of the new fair loop in 7, X m2 by taking the least 
common multiple of periods of the fair loops. 
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Theorem 2. Let P be a plant and PL = RIBBON(TWIN(P,Y), p,q, Fi, F2) is a 
ribbon structure where p = BL, q = =r OBSER, F, = 0, Fo = Fr U Fr. There 
exists a ribbon-shaped critical pair in P for the diagnosability of B iff Po E pn. 


Proof. Here, we sketch the proof of the theorem. The full proof is given in the 
extended version of the paper. We separately prove both directions of the equiv- 
alence: 


=> We have 7, 72 € lp satisfying Definition 4. We prove that there is a trace 
p € ITp. such that p H yw. At first, we show what the trace looks like and 
then we prove it is a trace of PL. Let 71,7, 72,r be a critical pair with two 
loops, where the first loop is equal to the loop in 71,72 and the second loop 
is fair. We construct the path p as symbolized in Figure 3. The main idea is 
to set piy to T1,F X T2,F. The existence of loop bounds k, 1, k’, l” follows from 
the definition of RCP. In the copy variables pz, we keep (17,7 X T2,r)[k] 


until the first loop is found and then we switch to (m,r X 72,7)(k’]. Flags 
Mseen and mz, are set accordingly to the bounds of the loops. Flags mp 
and my; are set to true after conditions 8; and f; respectively were true. 
The existence of such states where the conditions are satisfied follows from 
the definition of RCP. Flag m, is true until the first loop is found, because 
from the definition of RCP we know that =Gr and OBSEQ are true. 
The formal definition of p and the full proof that p € ITp, and p F ww is 
given in the appendix. 
<= We have P, H yn, thus there is p € Hp_ such that p H y~. Assume we 

have such p. We show how to construct 7, and 72 from p and then we prove 
that 71, 72 are an RCP for P and £. Let us set mı = piv, and m2 = pvp. 
Let the bounds k,l, k’,l’ of the loops in mı and 72 be the indices: 

— l’ is such that p, l’ = r,; 

— k’ < l' is the greatest index such that p, k’ = =Mseen; 

— l < k' +1 such that p, l = amy, \mz,’, from the construction we know 

there is only one such l; 

— k < lis the greatest index such that p, k = 7Mgeen. 

In Appendix A, we finish the prove by showing that 7, and m are an RCP. 


Theorem 3. Let P be a plant and P, = RIBBON(TWIN(P,Y), p,q, Fi, F2) is a 
ribbon structure where p = PL, q = œr A OBSEQ, Fi = Fr, Fo = Fr U Fr. 
There exists a ribbon-shaped critical pair in P for the fair diagnosability of B iff 
PL F gw. 


The proof is very similar to the previous one. The only difference is the 
necessity to verify the fairness of the first loop, which is done the same way as 
the fairness of the second loop and thus straightforward. 


6 Experimental Evaluation 


We compared the proposed technique based on L2S and the technique based on 
the computation of fixpoints using BDD proposed in [16] and briefly described 
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Fig. 3: The trace p as constructed in proof of Theorem 2. For each m € Vm, 
a dashed line means p,i | ~m, a full line and a full circle mean p,i = m, 
an empty circle means p,i + 1 H m. 


in Section 4.3. We implemented both algorithms in the xSAP platform [4] and 
tested them on benchmarks. The benchmarks, the tool and the scripts required 
to test it can be found online!. In this section, we at first introduce the imple- 
mentation of the proposed technique and we describe the benchmarks. Then, we 
show comparison of the two techniques and we comment on their performance. 


6.1 Implementation 


We have implemented both the L2S algorithm and the BDD-based algorithm 
inside of the xSAP tool [4]. The algorithms make use of various procedures 
already implemented in nuXmv [8] and integrated in xSAP, mainly computation 
of fixpoint with BDDs [7] and different invariant model checking algorithms. The 
fair states are computed with the Emerson-Lei doubly-fixpoint algorithm [1]. 
The invariant model checking is implemented using engines based on standard 
verification algorithms IC3 [6], k-induction [18] and BDD-based fixpoint [11]. 

The input of each algorithm is a model in an SMV language’, a list of ob- 
servable variables of the model, a propositional diagnosis condition and an LTL 
formula representing the context. Both the model and the context are translated 
into Buchi automata and their parallel composition with the union of their ac- 
cepting states is computed. The resulting set of accepting states is the set of 
fairness conditions. Then, a twin plant is constructed. The fixpoint-based algo- 
rithm is described in Section 4. 


1 http: //es.fbk.eu/people/vvozarova/diag-rcp-search.zip 
2 see nuXmv manual htpps://nuxmv.fbk.eu) 
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Table 1: Properties of the used models. 


model #bool var #reach diam #tobs #fairness 
acex 31 oe 96 5-21 1 
autogen 99 912:0 20 4-20 1-4 
cassini 176 gee 8 5-58 1 
guidance 98 oft? 70 5-62 1 
pdist 83 git 31 5-41 1-4 


In the L25 algorithm, we get the ribbon structure P, and the propositional 
formula y~ by extending the twin plant with new variables and transitions as 
defined in Section 6. The parameters p and q are constructed from the diagnosis 
condition and the set of observable variables. Finally, an arbitrary reachability 
algorithm is used to solve the reachability of p~ in the resulting system. 


6.2 Benchmarks 


We selected several benchmarks modelling industrial use cases. The models are 
finite with boolean variables. For each model, we have specified a fault condition 
and possibly more sets of fairness conditions. Both the fault condition and the 
fairness conditions are given as propositional formulas. In Table 1, we give for 
each model the number of variables, the number of reachable states, the diameter 
of the state space, the sizes of sets of observable variables and the sizes of fairness 
condition sets. 

Each benchmark was tested with more sets of observable variables and some 
were tested with more sets of fairness conditions. In sum, we have 72 examples 
for diagnosability and fair diagnosability problems and each instance was solved 
by BDD-based fixpoint approach and L2S approach with IC3, k-induction and 
BDD engines. This gives the total of 576 individual invocations of the xSAP 
tool. The experiments were run in parallel on a cluster with nodes with Intel 
Xeon CPU running at 2.27GHz with 8CPU, 48GB. The timeout for each run 
was two hours and the memory cap was set to 8GB. 


6.3 Results 


The results for selected examples are given in Table 2 and all results are plotted 
in Figure 4a for diagnosability and in Figure 4b for fair diagnosability. We com- 
pare the BDD-based fixpoint algorithm (FP-BDD) with L2S with IC3 engine 
algorithm (L2S-IC3). The k-induction engine was unable to prove diagnosabil- 
ity with the given bound on k (150), time and memory. In general, it performs 
better on cases where a counterexample exists, which are not of concern in this 
paper. The runs for L2S with BDD engine reached timeout in 127 out of 144 
cases and for this reason we do not include it in the analysis. 

As both figures and the table show, the approach using L2S extension is in 
most cases more effective than the BDD-based approach proposed in the previous 
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Fig. 4: Results for the diagnosability (a) and the fair diagnosability (b) comparing 
L2S approach with IC3 engine and BDD-based fixpoint computation approach. 
The axes represent time in seconds on a logarithmic scale. 


literature. The novel technique manages to outperform the previous one in most 
cases, as is shown by the cases plotted below the diagonal line on each figure. 
Moreover, it manages to solve some cases in which the fixpoint-based algorithm 
timed out. For the acex model, FP-BDD performs better than L2S-IC3. This 
is because the model has few boolean variables, thus BDDs are smaller and 
operations on them are faster. In addition, IC3 needs 56-116 frames to prove 
non-reachability on acex, compared to 3-62 frames in other cases. 


7 Conclusions and Future Work 


In this paper, we considered the problem of proving the absence of a ribbon- 
shaped path, which is a core issue in proving diagnosability of fair transition 
systems. We conceived a new encoding extending the liveness-to-safety paradigm 
in order to search for two consecutive loops. We implemented the algorithm in the 
xSAP tool and evaluated it on various diagnosability benchmarks in comparison 
with a fixpoint-based solution. 

The directions for future work are manifold: first, generalize the looping 
conditions to consider also problems different from diagnosability such as non- 
interference properties (as in [13]); second, exploit the generation of infinite sets 
of traces in counterexample-guided abstraction refinement, reducing the number 
of refinement iterations; finally, extend the approach to infinite-state systems, 
taking into account data variables that are updated in the loop (as in [9]). 
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Table 2: Results comparing L2S with IC3 engine and BDD-based algorithm. The 
times are given in seconds, TO stands for the timeout of 7200 seconds. All cases 
are diagnosable. 


: diagnosability fair diagnosability 

Model. yobs.. sees L2S-IC3| FP-BDD L2S-IC3| FP-BDD 
acex D: 1 TO 29.59 3114.25 30.58 
9 1 4385.18 22.25 1493.26 23.22 

13 1 992.60 24.25 1203.87 21.94 

17 1 1450.14 24.65 1754.44 25.35 

21 1 1328.43 27.22 1996.66 30.39 

autogen |4 1 676.89 657.07 179.27 809.45 
2 300.14 968.09 994.24 840.69 

3 415.00 741.83 756.72 988.33 

4 228.62 5638.46 800.11 5457.23 

16 1 2231.98 5188.65 420.75 5318.42 

2 379.31 TO 586.94 TO 

3 411.57 6300.83 274.74 5459.99 

4 771.76 TO 574.25 TO 

20 1 482.92 4741.88 522.16 5573.37 

2 548.96 6016.29 943.53 6043.12 

3 426.54 5728.60 945.79 5768.85 

4 1134.01 TO 568.33 TO 

cassini 5 1 31.85 TO 51.11 TO 
10 1 82.48 TO 60.35 TO 

15 1 90.62 TO 71.00 TO 

20 1 41.50 TO 61.95 TO 

25 1 62.39 TO 64.06 TO 

58 1 58.36 TO 64.65 TO 
guidance |5 1 425.76 349.59 196.27 370.38 
10 1 173.75 1663.83 245.50 1727.08 

15 1 250.78 4616.18 128.19 4678.15 

20 1 271.89 2928.52 300.66 3598.55 

25 1 224.58 TO 507.58 TO 

62 1 278.82 TO 95.00 TO 

pdist 5 1 95.19 458.85 96.81 350.6 
2 46.33 511.33 46.72 435.57 

3 48.09 424.51 40.19 419.86 

4 80.44 420.94 29.07 388.84 

20 1 36.72 32.96 1635.52 35.92 

2 22.47 28.86 35.54 34.19 

3 71.29 34.42 34.19 35.74 

4 33.86 33.65 28.98 31.97 

25 1 773.29 246.48 285.85 280.56 

2 54.20 215.76 38.83 279.42 

3 35.06 216.55 25.86 219.13 

4 24.75 217.05 16.75 217.25 

41 1 23.82 818.28 38.74 859.25 

2 31.33 643.03 50.58 759.50 

3 41.38 633.24 14.40 782.73 

4 22.21 750.93 18.93 818.21 
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Abstract. There is no silver bullet for software verification: Different 
techniques have different strengths. Thus, it is imperative to combine 
the strengths of verification tools via combinations and cooperation. 
CoVeriTeam is a language and tool for on-demand composition of cooper- 
ative approaches. It provides a systematic and modular way to combine 
existing tools (without changing them) in order to leverage their full 
potential. The idea of cooperative verification is that different tools help 
each other to achieve the goal of correctly solving verification tasks. 
The language is based on verification artifacts (programs, specifications, 
witnesses) as basic objects and verification actors (verifiers, validators, 
testers) as basic operations. We define composition operators that make it 
possible to easily describe new compositions. Verification artifacts are the 
interface between the different verification actors. CoVeriTeam consists 
of a language for composition of verification actors, and its interpreter. 
As a result of viewing tools as components, we can now create powerful 
verification engines that are beyond the possibilities of single tools, avoid- 
ing to develop certain components repeatedly. We illustrate the abilities 
of CoVeriTeam on a few case studies. We expect that CoVeriTeam will 
help verification researchers and practitioners to easily experiment with 
new tools, and assist them in rapid prototyping of tool combinations. 


Keywords: Cooperative Verification - Tool Development - Software Verification 
- Automatic Verification - Verification Tools - Tool Composition - Tool Reuse 


1 Introduction 


As research in the field of formal verification advanced, the complexity of the 
programs under verification also kept on increasing. As a result, despite its 
successful application to the source code of large industrial and open-source 
projects [2, 3, 23, 27,36], the current techniques fall short on solving many im- 
portant verification tasks. It seems essential to combine the strengths of dif- 
ferent verification techniques and tools to solve these tasks. 

The verification community successfully applies different approaches to com- 
bine ideas: integrated approaches (source-code-based), where different pieces 
of source code are integrated into one tool [28], and off-the-shelf approaches 
(executable-based), where different executables from existing tools are combined 
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without changing them. The latter can be further classified into sequential and 
parallel portfolio [33], algorithm selection [|37], and cooperative approaches [22]. 


The integrated approaches require development effort for adaptation or im- 
plementation of integrated components instead of building on existing mature 
implementations—the combination is very tight. On the other hand, the standard 
off-the-shelf approaches (portfolio [33] and selection [37]) let the tools run in 
isolation and the individual tools do not cooperate at all. The components do not 
benefit from the knowledge that is produced by other tools in the combination— 
the combination is very loose. In this work, we focus on cooperative verification, 
which is neither as tight as source-code integration nor as loose as portfolio 
and selection approaches—somewhere in between the two extremes. 

Cooperative verification [22] is an approach to combine different tools for 
verification in such a way that they help each other solving a verification task, 
where the combinations are neither too tight nor too loose. Implementations 
include using a shared data base to exchange information (e.g., there are co- 
operative SAT solvers that use a shared set of learned clauses [34], and coop- 
erative software verifiers that use a shared set of reached abstract states [14]) 
or pass information from one tool to the other (e.g., conditional model check- 
ers [13, 25]). Cooperative verification aims to combine the individual strengths 
of these technologies to achieve better results. Our thesis is that programming 
(meta) verification systems based on combination and cooperation could be a 
promising solution. CoVERITEAM provides a framework to achieve this. 

Developing such a tool is not straightforward. Various concerns that need 
to be addressed for developing a robust solution can be broadly divided in 
two categories: concepts and execution. (1) Concepts deal with defining the 
interfaces for tools, and with the mechanism for their combination. Before tools 
can cooperate, we need a common definition of tools based on their behavior. 
We need to categorize what a tool does, what inputs it consumes, and what 
outputs it produces, before we can use it in a cooperative framework with ease. 
In CoVERITEAM, we categorize tools in various types of verification actors, and 
the inputs and outputs produced by these actors in verification artifacts. The 
actors can be combined with the help of composition operators that define the 
mechanism of cooperation. (2) Execution is concerned with all issues during 
the execution of a tool. Actors first need to execute to cooperate. This opens 
another dimension of challenges and opportunities to improve the cooperation. 
To give two examples: a tool might have a too high resource consumption, thus, 
resources must be controlled and limited, and tools might interfere with other 
executing processes, thus, tools must be executed in isolated containers. 


This paper presents CoVERITEAM, a language and tool for on-demand com- 
position of cooperative verification systems that solves the above mentioned 
challenges. We contribute a domain-specific language and an execution engine. In 
the CoVERITEAM language, we can compose new actors based on existing ones 
using built-in composition operators. The existing components are not changed, 
but taken off-the-shelf from actor providers (technically: tool archives). We do 
not change existing software components: the composition is done on-demand 
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(when needed by the user) and on-the-fly (it does not compile a new tool from the 
components). In other words, existing verification tools are viewed as off-the-shelf 
components, and can be used in a larger context to construct more powerful 
verification compositions. Our approach does not require writing code in program- 
ming languages used to develop the underlying components. In the CoVERITEAM 
language, the user can execute tools without fearing that they interact with the 
host system or with other tools in an unspecified way. The execution environment, 
as well as input and output, are controlled using the Linux features cgroups, 
name spaces, and overlay file systems. We use the BENCHExEc [20] system as 
library for convenient access to those OS features via a Python API. 


Contributions. We make the following contributions: 


1. a language to compose new verification tools based on existing ones, 

2. an execution engine for on-the-fly execution of these compositions, 

3. case studies implementing combinations in CoVERITEAM that were previously 
achieved only via hard-wired combinations, and 

4. an open-source implementation and an artifact for reproduction. 


In addition to the above mentioned contributions, CoVERITEAM provides the 
following features to the end user: (1) COVERITEAM takes care of downloading 
and installing specified verification tools on the host system. (2) There is no need 
to learn command-line parameters of a verification tool because COVERITEAM 
takes care of translating the input to the arguments for the underlying tool. This 
provides a uniform interface to a number of similar tools. (3) The off-the-shelf 
components (i.e., tools) are executed in a container, with resource limits, such 
that the execution cannot change (or even damage) the host system. 

These features in turn enable a researcher or practitioner to easily exper- 
iment with new tools, and rapidly prototype new verification combinations. 
CoVERITEAM liberates the researcher who uses tool combinations from main- 
taining scripts that combine tools executions, and worrying about downloading, 
installing, and figuring out the command to execute a verification tool. 


Impact. CoVeErRITEAM has already found use cases in the verification community: 
(1) It was used in a modular implementation of CEGAR [26] using off-the-shelf 
components [12]. (2) It was used for construction and evaluation of various veri- 
fier combinations [17]. (3) CoVeRITEAM (wrapped in a service) was used in the 
software-verification competition 2021 and 2022 to help the participants debug is- 
sues with their tools (see Sect. 3 in [7]). Also, according to SV-COMP rules, a team 
is granted points only for those tasks whose result can be validated using a valida- 
tor. Thus, a verifier-validator combination might be interesting for participants. 
With the help of CoVERITEAM such combinations can be easily constructed. 

Also, the advent of many high-quality verifiers should lead to a certain 
level of standardization of the API and provided features. For example, tools 
for SMT or SAT solving are easy to use because of their standardized input 
language (e.g., SMTLIB for SMT solvers [4]). Consequently, such tools can be 
easily integrated into larger architectures as components. Our vision is that soon 
verifiers will be seen also as components that can be used in larger architectures 
just like SMT solvers are now integrated into verification tools. 
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Example 1 Witness Validation 
Input: Program p, Specification s 
Output: Verdict 
1: verifier := Verifier(“Ultimate Automizer”) 
: validator := Validator(“CPAchecker”) 
: result := verifier.verify(p, s) 
: if result.verdict € {TRUE, FALSE} then 
result = validator.validate (p, s, result.witness) 
: return (result.verdict, result.witness) 


Qa ww 


2 Running Example 


We explain the idea of CoVERITEAM using a short example. Verifiers are complex 
software systems and might have bugs. Therefore, for more assurance a user 
might want to validate the result of a verifier based on the verification witness 
that the verifier produces [10]. Such a procedure is sketched in Example 1. 
The user wanting to execute the procedure sketched in Example 1 would 
need to download the tools (verifier and validator), execute the verifier, check 
the result of the verifier, and then if needed connect the outputs of the verifier 
with the inputs of the validator. The user would quite possibly write a shell 
script to do this, which is cumbersome and difficult to maintain. 
CoVeErITEAM takes care of all the above issues. In the next section, we discuss 
the types, namely artifacts and actors, that are used in the CoVERITEAM language. 
After this, we explain the design and usage of the CoVERITEAM execution engine, 
and discuss the CoOVERITEAM program for our validating verifier in Listing 1. 


3 Design and Implementation of CoVERITEAM 


We now explain details about the design and implementation of COVERITEAM. 
First we discuss conceptual notions of actors, artifacts, and compositions; then 
we discuss execution concerns that a cooperative verification tool needs to 
handle. Then we delve deeper into implementation details where we discuss 
how an actor is created and executed. Last, we briefly explain the API that 
CoVERITEAM exposes and extensibility of this API. 


3.1 Concepts 


This section describes the language that we have designed for cooperative verifica- 
tion and on-demand composition. At first we describe the notion of artifacts and 
actors, and then the composition language to compose components to new actors. 


Artifacts and Actors. Verification artifacts provide the means of information 
(and knowledge) exchange between the verification actors (tools). Figure 1 shows 
a hierarchy of artifacts, restricted to those that we have used in the case stud- 
ies for evaluating our work. On a high level we divide verification artifacts in 
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Artifact 


4 
Justification Verdict Program Specification 
A Ey 
TestSuite Condition Witness BehaviorSpec TestSpec 
CoveredGoals CoveredSpace Termination Safety Overflow TestGoal | | CoverageCriterion 


Fig. 1: Hierachy of Artifacts (arrows indicate an is-a relation) 


Actor 


$ 
Analyzer Transformer 
ĝ 
Verifier Tester Validator TestGoalExtractor | | Reducer | | Instrumentor | | WitnessToTest 
ConditionalVerifier | | ConditionalTester | | WitnessValidator | | TestValidator | | Pruner | | Annotator | | WitnessIns TestSpecIns 


Fig. 2: Hierachy of Actors (arrows indicate an is-a relation) 


the following kinds: Programs, Specifications, Verdicts, and Justifications. Pro- 
grams are behavior models (can be further classified into programs, control-flow 
graphs, timed automata, etc.). Specifications include behavioral specifications 
(for formal verification) and test specifications (coverage criteria for test-case 
generation). Verdicts are produced by actors signifying the class of result ob- 
tained (TRUE, FALSE, UNKNOWN, TIMEOUT, ERROR). Justifications for the 
verdict are produced by an actor; they include test suites to justify an obtained 
coverage, or verification witnesses to justify a verification result. 

Verification actors act on the artifacts and as a result either produce new arti- 
facts or transform a given artifact for consumption by some other actor. Figure 2 
shows a hierarchy of actors, restricted to those that we have used in the case stud- 
ies for evaluating our work. We divide verification actors in the following types: 
Analyzers and Transformers. Analyzers create new knowledge, e.g., verifiers, val- 
idators, and test generators. Transformers instrument, refine, or abstract artifacts. 


Composition. Actors can be composed to create new actors. Our language 
supports the following compositions: sequence, parallel, if-then-else, and repeat. 

CoVERITEAM infers types and type-checks the compositions, and then either 
constructs a new actor or throws a type error. In the following, we use the nota- 
tion Ia for the input parameter set of an actor a and O, for the output parameter 
set of a. A parameter is a pair of name and artifact type. A name clash between 
two sets A and B exists if there is a name in A that is mapped to a different 
artifact type in B, more formally: 3(a, tı) E€ A, (a,t2) € B: tı Æ t2. The actor 
type is a mapping from input parameter set to output parameter set (I, > Oa). 


Sequential. Given two actors al and a2, the sequential composition SEQUENCE 
(al, a2) (Fig. 3a) constructs an actor that executes al and a2 in sequence, 
i.e., one after another. The composition is well-typed if there is no name clash 
between Ig; and (Iaz \ Oai). This means that we allow same artifact to be 
passed to the second actor in sequence, but disallow the confusing scenario 
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(d) REPEAT 


Fig. 3: Compositions in COVERITEAM 


where both actors expect an artifact with the same name but different type. 
The inferred type of the composition is Ig; U (Iaz \ Oa1) > Oaz- 


Parallel. Given two actors al and a2, the parallel composition PARALLEL (a1, 
a2) (Fig. 3b) constructs an actor that executes the actors al and a2 in par- 
allel. The composition is well-typed if (a) there is no name clash between 
Iq, and Igg and (b) the names of Ogi and Oaz are disjoint. The inferred 
type of the composition is Ig; U Igg2 > Oai U Oaz- 


ITE. Given a predicate cond and two actors al and a2, the if-then-else com- 
position ITE (cond, al, a2) (Fig. 3c) constructs an actor that executes the 
actor al if predicate cond evaluates to true, and the actor a2 otherwise. The 
composition is well-typed if (a) there is no name clash between cond, Iai, and 
Iaz, and (b) the output parameters are the same (Ogi = Oa2). The inferred 
type of the composition is Iq, U Igg U vars(cond) > Oai, where vars maps the 
variables used in a predicate to their artifact types. This allows us to define the 
condition cond using artifacts other than the inputs of Ig, and Ig. 

There are situations where a2 is not required and its explicit specification only 
increases complexity. So, we have relaxed the type checker and made a2 optional. 


Repeat. Given a set fp and an actor a, the repeat composition REPEAT (fp, a) 
(Fig. 3d) constructs an actor that repeatedly executes actor a until a fixed- 
point of set fp is reached, that is, fp did not change in the last execution 
of a. The repeat composition feeds back the output of a from iteration n to a 
for iteration n + 1. Let us partition Ia U Oa into three sets: Ia \ Oa, Oa \ La, 
and Ia N Oa. The parameters in I, \ Oa do not change their value and the 
parameters in Oa \ Ia are accumulated if accumulatable, otherwise their value 
after the execution of the composition is the value from the last iteration. The 
composition is well-typed if fp C dom(Ia N Oa), where dom returns the names 
of a parameter set. The inferred type of the composition is I, > Oa. 
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Fig. 4: CoVERITEAM implementation of the validating verifier from Example 1 


Figure 4 shows the pictorial representation of our running example using 
these compositions. First a verifier is executed, then the validator is executed if 
the verifier returned TRUE or FALSE, otherwise (in case of UNKNOWN, TIMEOUT, 
ERROR) the validator is not executed and the output of the verifier is forwarded. 


3.2 Execution Concerns 


A tool for cooperative verification orchestrates the execution of verification tools. 
This means it needs to assemble the command for the tool, as well as handle the 
output produced by the tool. A verification tool might consume a lot of resources 
and a user might want to limit this. It might crash during execution, might 
interfere with other processes. COVERITEAM needs to handle all these concerns. 
Instead of developing our own infrastructure to handle these concerns, we 
reuse some of the features provided by BENCHEXEc [20]: we use tool-info modules 
to assemble command lines and parse log output, RUNEXxEc (a component of 
BENCHEXEC) to execute tools in a container and limit resource consumption. 
Tool-Info modules are integration modules of the benchmarking framework 
BENCHExEc [20]. A typical tool-info module is a few lines of code used for 
assembling a command line and parsing the log output produced by the tool. It 
takes only a few hours to create one.! CoVERITEAM uses tool-info modules to 
pass artifacts to atomic actors (assemble command-line) and extract artifacts 
from the output produced by the atomic actor. Using tool-info modules gave 
us integration of more than 80 tools without effort, because such integration 
modules exist for most well-known verifiers, validators, and testers (as many 
researchers use BENCHEXEc and provide such integration modules for their tools). 
CoVERITEAM uses RUNEXEC to isolate tool execution to prevent interference 
with the execution environment and enforce resource limits. We also report back to 
the user the resources consumed by the tool execution as measured by RUNEXEC. 


3.3 COVERITEAM 


Figure 5 provides an abstract view of the system. CoVERITEAM takes as input 
a program written in the CoVeRITEAM language and artifacts. At first, the 
code generator converts this input program to Python code. This transformed 


1 We claim this based on our experience with tool developers creating their tool-info 
modules, which is a prerequisite for participating in SV-COMP or Test-Comp. 
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Fig.5: Abstract view of the CoVERITEAM tool 


verifier = ActorFactory.create(ProgramVerifier, 
"“actors/uautomizer.yml") ; 

validator = ActorFactory.create (ProgramValidator, 
"“actors/cpa-validate-violation-witnesses.yml"); 


// Use validator if verdict is true or false 

condition = ELEMENTOF (verdict, {TRUE, FALSE}); 
second_component = ITE (condition, validator); 

// Verifier and second component to be executed in sequence 
validating_verifier = SEQUENCE (verifier, second_component) ; 


// Prepare example inputs 

prog = ArtifactFactory.create(CProgram, prog_path) ; 

spec = ArtifactFactory.create (BehaviorSpecification, spec_path) ; 
inputs = {’program’:prog, ’spec’:spec}; 

// Execute the new component on the inputs 

res = execute(validating_verifier, inputs); 

print (res); 


Listing 1: CoVerITEAM implementation of the validating verifier from Example 1 


code uses the internal API of CoVeriTEaAm. Then this Python code is executed, 
which means the actor executor is called on the specified actor. This in turn 
produces output artifacts on successful execution of the actor. 

There are four key parts of executing a CoOVERITEAM program: creation of 
atomic actors, composition of actors (atomic or composite), creation of arti- 
facts, and execution of the actors. We now give a brief explanation of these 
parts with the help of our running example. Listing 1 shows a CoVERITEAM 
implementation of the running example (Example 1). 


Creation of an Atomic Actor. Atomic actors in CoVERITEAM provide an in- 
terface for external tools. CoVERITEAM uses the information provided in an actor- 
definition file to construct an atomic actor. Lines 1 and 2 in Listing 1 show the cre- 
ation of atomic actors verifier and validator using the ActorFactory by provid- 
ing the ActorType and the actor-definition file. Once constructed, this actor can be 
executed. 

An actor definition is specified in a file in YAML format. It contains the 
information necessary for executing the actor: location from where to download 
the tool, the name of the tool-info module to assemble the command line and parse 
tool output, additional command-line parameters for the tool, resource limitations 
to enforce, etc. Listing 2 shows the actor definition file for UAUTOMIZER [32]: the 
actor name is uautomizer, the identifier for the BENCHExEc tool-info module is 


oOo ooN AUN HK 


e H 
e o 
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actor_name: uautomizer 
toolinfo_module: "ultimateautomizer.py" 
archive: 
doi: "10.5281/zenodo.3813788" 
spdx_license_identifier: "LGPL-3.0-or-later" 


options: [?’--full-output’, ’--architecture’, ’32bit’] 
resourcelimits: 

memlimit: "15 GB" 

Eimelimit: "15 min" 

epuceres: "8" 


format version: ’1.1? 


Listing 2: Definition of atomic actor in YAML format 


ultimateautomizer, the DOI of the tool archive (or the URL for obtaining the 
tool archive), the SPDX license identifier, the options passed by CoVERITEAM to 
UAutomizer, and resource limits for the execution of the actor. Once an atomic 
actor has been constructed using an actor definition, COVERITEAM has all the 
information necessary to execute the underlying tool with the provided artifacts. 


Composition of an Actor. The second key part is the composition of an actor. 
Lines 6 and 8 in Listing 1 create composite actors using ITE and SEQUENCE, 
respectively. It is these compositions that create the validating verifier of our 
running example. Verification actors in CoVERITEAM can exchange information 
(artifacts) with other actors and cooperate through compositions. 


Creation of an Artifact. The notion of artifact in CoVERITEAM is a file 
wrapped in an artifact type. The underlying files are the basis of an artifact— 
exchangeable information. Lines 11 and 12 in Listing 1 create artifacts using the 
ArtifactFactory by providing the ArtifactType and the artifact file. These artifacts 
would then be provided to the executor that then executes the actors on them. 


Code Generation. The code generator of CoVERITEAM translates the input pro- 
gram to Python code that uses the internal API of CoVERITEa. It is a canonical 
transformation in which the statements for creation of actors and artifacts are 
converted to Python statements instantiating corresponding classes from the API. 
Similarly, statements for composition and execution of actors are also transformed. 


Execution. Analogously to the construction of actors, the execution of an actor 
in CoVERITEAM is also divided in two: atomic and composition. Line 15 in 
Listing 1 executes the actor validating_verifier on the given input artifacts. 
Figure 6 shows the actor executor for both atomic and composite actors. It 
executes an actor on the provided artifacts. At first it type checks the inputs, i.e., 
check if the input types provided to actor comply with the expected input types of 
the actor. It then calls the executor for atomic or composite actor depending on the 
actor type. Thereafter, it type checks the outputs, and at last returns the artifacts. 
Execution of an atomic actor means the execution of the underlying tool 
on the provided artifacts. At first, the executor downloads the tool if necessary. 
CoVERITEAM downloads and unzips the archive that contains the tool on the 
first execution of an atomic actor. It keeps the tool available in cache for later 
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Fig.6: Abstract view of an actor execution in CoVERITEAM 


executions. After this step, the command line for the tool is prepared using the 
tool-info module. It then executes the tool in a container, and then processes the 
tool output, i.e., extracts the artifacts from the tool output and saves them. 

Execution of a composition means executing the composed actors—making 
information produced by one available to other during the execution—as per 
the rules of composition. The composite-actor executor at first selects the next 
child actor to execute. It then computes the inputs for this selected actor. Then 
it executes this actor, which can be atomic or another composite actor, on 
these inputs. It then processes the outputs produced by the execution of the 
selected child actor. This processing could be temporarily saving, filtering, or 
modifying the produced artifacts. If needed, it then proceeds to execute the 
next child actor, otherwise exits the composition execution. 


Output. CoVERITEAM collects all the artifacts produced during the execution of 
an actor, and saves them. The output can be divided into three parts: execution 
trace, artifacts, and log files. An execution trace is an XML file containing infor- 
mation about the artifacts consumed and produced by each actor, and also the 
resources consumed by atomic actors (as measured by BENCHEXEc) during the ex- 
ecution. COVERITEAM also saves the artifacts produced during the execution of an 
actor. Additionally, for each atomic actor execution, it also saves a log file contain- 
ing the command which was actually executed and the messages printed on stdout. 


3.4 API 


In addition to the above described features, CoVERITEAM exposes an API that is 
extensible. We expose actors, artifacts, utility actors, and compositions through 
Python packages. In this section, we briefly discuss this API. 


Library of Actors and Compositions. CoVERITEAM provides a library of 
some actors and a few compositions that can be instantiated with suitable 
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actors. We considered actors based on the tools participating in the compe- 
titions on software verification and testing [5,6] (available in the replication 
archives), because those are known to be mature and stable. 

The library of compositions contains a validating verifier, an execution-based 
validator [11], a reducer-based construction of a conditional model checker [15], 
ConpTesst [18], and MrraVat [21]. These are present in the examples/ directory 
of the CoVERITEAM repository. We discuss some of these constructions in Sect. 4.1. 


New Actors, Artifacts, and Tools. New actors, artifacts, and tools can be 
integrated easily in CoVeriITEaAM. The integration of a new atomic actor requires 
only creating a YAML actor definition and, if not already available, implementing 
a tool-info module. The integration of a new actor type in the language requires 
(1) creating a class for the actor specifying its input and output artifact types, 
(2) preparing the parameters to be passed to tool-info module, that in turn 
would create a command line for the tool execution, using the options from 
the YAML actor definition, and (3) creating output artifacts from the output 
files produced by the execution of an atomic actor of that type. 

Integration of a new artifact requires creating a new class for the artifact. 
A basic artifact requires a path containing the artifact. Some artifacts support 
special features, for example, a test suite is a mergeable artifact (i.e., two test 
suites for a given input program can be merged into one test suite). 

Integrating a new tool in the framework requires: (1) creating the tool-info 
module for it, (2) creating an actor definition for the tool, (3) providing a 
self-contained archive that can be executed on a Ubuntu machine. 

At present, CoVERITEAM supports all verifiers and validators that are listed 
on the 2021 competition web sites of SV-COMP? and Test-Comp?. One needs 
only a few hours to create a new tool-info module and an actor-definition 
file. Within a couple of hours we were able to create the actor definitions for 
about 40 tools participating in SV-COMP and Test-Comp. 


4 Evaluation 


We now present our evaluation of CoVERITEAM. It consists of a few case studies, 
and insights from the experiments to measure performance overhead. 


4.1 Case Studies 


We evaluated CoVeERITEAM on four more case studies, as indicated in the fourth 
column of Table 1. We now explain two of these case studies using figures for 
compositions. The programs and explanations for all of the case studies are also 
available in our project repository (linked from the last column of Table 1). 


Conditional Testing à la ConpTsst. Conditional testing [18] allows coop- 
eration between different test generators (testers) by sharing the details of the 


? https: //sv-comp.sosy-lab.org/2021/systems.php 
3 https: //test-comp.sosy-lab.org/2021/systems.php 
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Table 1: Examples of cooperative techniques in the literature 


Technique Year Reference Case Study More Info 
Counterexample Checking [38] 2012 Sect. 5 
Conditional Model Checking [13] 2012 Sect. 5 
Precision Reuse [19] 2013 Sect. 5 

Witness Validation [8, 10] 2015, 2016 Figure 4 v Sect. 3.3 

Execution-Based Validation [11] 2018 Sect. 5 v More info 

Reducer [15] 2018 Sect. 5 v More info 
CoVeniTesr [14] 2019 Sect. 5 

ConpTesr [18] 2019 Figures 7 and 8 y More info 

Mera Var [21] 2020 Figure 9 v More info 


already covered test goals. A conditional tester outputs a condition, in addition to 
the generated test suite, representing the work already done. Then this condition 
is passed as an input to another conditional tester, in addition to the program 
and test specification. This tester can then focus on only the uncovered goals. 


Fig. 7: Design of a conditional tester in CoVERITEAM 


Conditional testers can be constructed from off-the-shelf testers [18] with 
the help of three tools: a reducer, an extractor, and a joiner. A reducer used 
in conditional testing (Program x Specification x Condition + Program) produces 
a residual program with the same behavior as the input program with respect 
to the remaining test goals. A set of test goals represents the condition. An 
extractor (Program x Specification x TestSuite — Condition) extracts the condition 
—a set of test goals— covered by the provided test suite. 

Figure 7 shows the composition of a conditional tester. First, the reducer 
produces the reduced program. The composition here uses a pruning reducer, 
which prunes the program according to the covered goals. Second, the tester 
generates the test cases. Third, the extractor extracts the goals covered in these 
test cases. Forth, the joiner merges the previously and newly covered goals. The 
reducer that we used expects the input program to be in a format containing 
certain labels for the purpose of tracking test goals. So, we put an instrumentor 
that instruments the test specification into the program, by adding these labels. 

The conditional-testing concept can also be used iteratively to generate a test 
suite using a tester based on a verifier [18]. Such a composition uses a verifier as a 
backend and transforms a counterexample generated by the verifier to a test case. 

Figure 8 shows the construction of a cyclic conditional tester. In this case, the 
tester itself is a composition of a verifier and a tool, Witness2Test, which generates 
test cases based on a witness produced by a verifier. This tester, in composition 
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Fig. 8: Design of a cyclic conditional tester in COVERITEAM 


with a reducer, extractor, and a joiner is our conditional tester. This construction 
uses an annotating reducer, which (i) annotates the program with error labels 
for the verifier to find the path to and (ii) filters out the already covered goals, 
i.e., the condition, from the list of goals to be annotated. We put the conditional 
tester in the REPEAT composition to execute iteratively. The composition tracks 
the set ‘covered_ goals’ to detect the fixed point to decide termination of the 
iteration. This composition will keep on accumulating the test suite generated in 
each iteration and finally output the union of all the generated test suites (see 
Sect. 3.1). As above, an instrumentor is placed before the conditional tester. 


Verification-Based Validation à la MeTAVAL. METAVAL [21] uses off-the- 
shelf verifiers to perform validation tasks. A validator (Program x Specification x 
Verdict x Witness — Verdict x Witness) validates the result produced by a verifier. 
MEtTAVAL employs a three-stage process for validation. In the first stage, METAVAL 
instruments the input program with the input witness. The instrumented program 
—a product of the witness and the original program— is equivalent to the original 
program modulo the provided witness. This means that the instrumented program 
can be given to an off-the-shelf verifier for verification; and this verification 
functions as validation. In the second stage, METAVAL selects the verifier to use 
based on the specification. It chooses CPACHECKER for reachability, UAUTOMIZER 
for integer overflow and termination, and Sympiotic for memory safety. In 
the third stage, the instrumented program is fed to a verifier along with the 
specification for verification. If the verification produces the expected result, 
then the result is confirmed and the witness valid, otherwise not. 


a ae ee EEEE E EREE ee 


Fig. 9: Design of MeTaVaL in COVERITEAM 


Figure 9 shows the construction of METAVAL. First, the selector is executed 
that selects the backend verifier to execute. After this step, the program is 


4 These were the best performing tools for a property according to SV-COMP results. 


574 Dirk Beyer and Sudeep Kanav 


instrumented with the witness, and then the instrumented program is given 
to the selected verifier for checking the specification. 


4.2 Performance 


CoVERITEAM is a lightweight tool. Its container mode causes an overhead of 
around 0.8s for each actor execution in the composition, and the tool needs 
about 44 MB memory. This means that if we run a tool 10 times in a sequence 
in a shell script unprotected and compare this to using the sequence composition 
in CoVERITEAM in protected container mode on the same input, the execution 
using CoVERITEAM will take 8s longer and requires 44 MB more memory. In our 
experience, this overhead is not an issue for verification as, in general, the time 
taken for verification dominates the total execution time. For short-running, high- 
performance needs, the container mode can be switched off. We have conducted 
extensive experiments for performance evaluation of CoVERITEAM and point the 
reader to the supplementary webpage for this article for more details. 


5 Related Work 


We divide our literature overview into two parts: approaches for tool combinations, 
and cooperative verification approaches. 


Approaches for Tool Combinations. Evidential Tool Bus (ETB) [29, 30,39] 
is a distributed framework for integration of tools based on a variant of Data- 
log [1, 24]. It stores the established claims along with the corresponding files and 
their versions. This allows the reuse of partial results in regression verification. 
ETB orchestrates tool interaction through scripts, queries, and claims. 

Our work seems close to ETB on a quick glance, but on a closer look there 
are profound differences. Conceptually, ETB is a query engine that uses claims, 
facts, and rules to define and execute a workflow. Whereas, CoVERITEAM has 
been designed to create and execute actors based on tools and their compositions. 
We give some semantic meaning, arguably simplistic, to the tools using (i) 
wrapper types of artifacts for the files produced and consumed by a tool and 
(ii) the notion of verification actors that allows us to see a tool as a function. 
This allows us to type-check tool compositions and allow only well-defined 
compositions. On the implementation side, we support more tools. This task was 
simplified by our design choice to use the integration mechanisms provided by 
BENCHEXEC (as used in SV-COMP and Test-Comp). Most well known automated 
verification tools already have been integrated in CoVERITEAM. 

Electronic Tools Integration platform (ETT) [40] was envisioned as a “one stop 
shop” for the experimentation and evaluation of tools from the formal-methods 
community. It was intended to serve as a tool presentation, tool evaluation, 
and benchmarking site. The idea was to allow users to access tools through the 
internet without the need to install them. An ETI user is expected to provide an 
LTL based specification, based on which an execution scheme is synthesized. 
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The key focus of ETI and its incarnations has been remote tool execution, 
and their integration over internet. The tools are viewed agnostic to their func- 
tion. We, in contrast, (i) have tackled local execution concerns and (ii) see a 
tool in its function as an actor that consumes and produces certain kinds of 
artifacts. The semantic meaning of a tool is given by this role. 


Cooperative Verification Approaches. Our work targets developing a frame- 
work to express and execute cooperative verification approaches. In this section 
we describe some of these approaches from literature. We have implemented some 
of these combinations in CoVERITEAM, some of which are described in Sect. 4. 

A reduction of the input program using the counterexample produced by 
a verifier was discussed [38], where the key idea is to use the counterxam- 
ple to provide the variable assignments to the program. 

Conditional model checking (CMC) [13] outputs a condition —a summary 
of the knowledge gained— if the model checker fails to produce a verdict. The 
condition allows another model checker to save the effort of looking into already 
explored state space. Reducers [15] can turn any off-the-shelf model checker into 
a conditional model checker. Reducers take a source program and a condition 
and produce a residual program whose paths cover the unverified state space 
(negation of the condition). Conditional testing [18] applies the principle of 
conditional model checking to testing. A conditional tester outputs, in addition 
to the generated test cases, the goals for which test cases have been generated. 

The idea of reusing the knowledge about already done work to reduce the 
workload of another tool was also applied to combine program analysis and 
testing [25, 31,35]. One of these approaches [31] is based on conditional model 
checking [13]. In this case, the condition is used to construct a residual program, 
which is then fed to a test-case generator. Another approach [25] instruments 
the program with assumptions and assertions describing the already completed 
verification work. Then a testing tool is used to test the assumptions. Program par- 
titioning [35] first performs the testing and then removes the satisfactorily tested 
paths and verifies the rest. CoVERITEst [14], cooperative verifier-based testing, is 
a tester based on cooperation between different verification-based test-generation 
techniques. CoVsRITEsT uses conditional model checkers [13] as verifier backends. 

Precision reuse [19] is based on the use of abstraction precisions. The precision 
of an abstract domain is a good candidate for cooperation because it is small 
in size, and represents important information, i.e., the level of abstraction at 
which the analysis works. A model checker in addition to producing a verdict 
also produces a file containing information specifying precision, e.g., predicates. 

Model checkers can also produce a witness, in addition to the verdict, as 
a justification of the verdict. These witnesses could be counterexamples for 
violations of a safety property, invariants as a proof of a safety property, a lasso 
for non-termination, a ranking function for termination, etc. These witnesses can 
be used later to help validate the result produced by a verifier [8,9, 10]. 

Execution-based result validation [11] uses violation witnesses to generate 
test cases. A violation witness of a safety specification is refined to a test case. 
The test case is then used to validate the result of the verification. 
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6 Conclusion 


Due to the free availability of many excellent verifiers, the time is ripe to view 
verification tools as components. It is necessary to have standardized interfaces, 
in order to define the inputs and outputs of verification components. We have 
identified a set of verification artifacts and verification actors, and a programming 
language for on-demand construction of new, combined verification systems. 
So far, the architectural hierarchy ends mostly at the verifiers: verifiers are 
based on SMT solvers, which are based on SAT solvers, which are based on 
data-structure libraries. CoVERITEAM wants to change this and use verification 
artifacts as first-class objects in specifying new verifiers. We show on a few 
selected examples how easy it is to construct some verification systems that 
were so far hard-coded using glue code and wrapper scripts. We hope that many 
researchers and practitioners in the verification community find it interesting 
and stimulating to experiment on a high level with verification technology. 
Future Work. The approach of CoVERITEAM opens up a whole new area of 
possibilities that yet needs to be explored. We have identified three key areas 
for the further work: (i) remote execution of tools, (ii) policy specification 
and enforcement, and (iii) more compositions and combinations. CoVERITEAM 
provides an interface for a verification tool based on its behavior. A web service 
wrapped around CoVERITEAM can be used to delegate execution of an actor, 
hence verification work, to the host of the service. The client for such a service can 
be transparently integrated in CoVERITEAM. In fact, we already provide client 
integration for a restricted and experimental version of such a service. Also, a user 
executing a combination of tools might want to have some restrictions on which 
tools should be allowed to execute. For example, a user might want to execute 
only those tools that comply with a certain license, or only those tools that are 
downloaded from a trusted source. A cooperative verification tool should support 
the specification and enforcement of such user policies. Further, we plan to support 
more compositions for cooperative verification in CoVERITEAM as we come across 
them. Recently, we were working on a parallel-portfolio composition [17]. 
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